Grasping Technology: 2017

Sunday, December 31, 2017

Keystone Identity Service on OpenStack: Database url issue

I remember one time trying to debug a problem with the setup of keystone.

It took me FOREVER to figure it out.

It was when I was, as part of set up procedure, running the script:
su -s /bin/sh -c "keystone-manage db_sync" keystone

This script will exit silently with a "1" code if it does not run. You MUST check the $? in bash to make sure the damned thing ran.

When I saw the "1" code, I went and checked the keystone log, which said:

2017-12-31 23:28:21.807 13029 CRITICAL keystone [-] NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:mysql.pymsql
2017-12-31 23:28:21.807 13029 ERROR keystone Traceback (most recent call last):
2017-12-31 23:28:21.807 13029 ERROR keystone File "/bin/keystone-manage", line 10, in <module>
2017-12-31 23:28:21.807 13029 ERROR keystone sys.exit(main())
2017-12-31 23:28:21.807 13029 ERROR keystone File "/usr/lib/python2.7/site-packages/keystone/cmd/manage.py", line 44, in main
2017-12-31 23:28:21.807 13029 ERROR keystone cli.main(argv=sys.argv, config_files=config_files)
2017-12-31 23:28:21.807 13029 ERROR keystone File "/usr/lib/python2.7/site-packages/keystone/cmd/cli.py", line 1312, in main
...

I started looking at all of these packages I'd installed, checking them (they were all there). I then went in search of help on google. And yes, the message was there, but no help fixing it.

I then realized...the URL was wrong in the file. Problem is, the naked eye can't seem to spot this error:

Incorrect URL:
connection = mysql+pymsql://keystone:KEYSTONE_DBPASS@controller/keystone

Correct URL:
connection = mysql+pymysql://keystone:KEYSTONE_DBPASS@controller/keystone

Optically, very hard to see that missing second "y" in "mysql" because of the "py". Fortunately, I lost only 30 minutes on this issue this time. Earlier, when installing Newton, I lost an entire day or more.

Tuesday, December 19, 2017

Port Binding Failures on OpenStack - How I fixed this

In trying to set up OpenStack for a colleague, I had an issue where I could not get the ports to come up. The port status for the networks would be in a state of "down", and I could not ping these ports from within the network namespaces, or from the router ports.

Unfortunately, I did not have DEBUG turned on in the logs. So I saw no output from nova or neutron about any kind of issues with networking or port bindings.

I did enable DEBUG, at which point I started to see "DEBUG" messages (not "ERROR" messages mind you, but "DEBUG" messages) about port binding failures. Finding these port binding debug messages was like looking for a needle in a haystack as there is a ton of debug output when you enable debug in Nova and Neutron.

I had a very very difficult time figuring out what was causing this. But here is how I fixed it:

1. I watched a segment on YouTube about ml2 and Neutron. This was an Auction OpenStack Summit, and the url is here:

https://www.youtube.com/watch?v=e38XM-QaA5Q

2. I quickly realized that host names were such an integral part of port binding, that it was necessary to check the agents, the host names of those agents in Neutron, and the host names stored in mysql.

In MySQL, the neutron database has a table called agents, and all agents are mapped to a host. That host needs to be correct and resolvable.

In the end, I wound up deleting some old agents that were no longer being used (old hosts, and some openvswitch agents that were lingering from a previous switch from linuxbridge to openvswitch). I then had to correct some hostnames because I had my OpenStack Controller and Network node in a VM and I had recycled that VM for my colleague - who had reassigned a new hostname for the VM on his platform.

Then, just to be thorough, I deleted all agents (i.e. DHCP agents), then deleted all subnets, then deleted all networks. I then re-created these - WITH NEW NAMES (so as to ensure that OpenStack wasn't re-using old ones); in order. First I created the networks, then I created the subnets, then I created the agents (which generally create the ports themselves). Lastly, I mapped these new subnets to the router as interfaces (which creates ports).

One thing that is EXTREMELY important, is that ports bind to the PHYSICAL network...not the Virtual network.

If you create a external provider network called "provider", and the physical network is called "physical", and then you go into ml2_conf.ini and linuxbridge.ini and use "provider" instead of "physical" in your bindings, you will most assuredly result in a port binding failure.

So these are the tips and tricks to solving the port binding issue, or configuring properly ahead of time so that you don't run into the port binding issue.

Wednesday, December 13, 2017

OpenStack Compute Node - state path and lock file directories

I may have posted on this topic before, because I ran into this issue before.

I was setting up OpenStack for a colleague of mine, and had all sorts of issues getting it to work.

A couple of problems I had were related to services that were not enabled, and when you rebooted the unit(s), the services were not starting up. These were easy to fix.

The difficult issue to find and fix - which took me almost a full business day - had to do with how OpenStack Nova configures itself.

In the /etc/nova.conf file, there is a variable called a state_path. This variable is set to /var/lib/nova - a directory nova creates upon installation and sets permissions to the nova user and group.

In this directory, is a subdirectory called "instances", where Nova puts running instances.

The problem, is that Nova on installation does not seem to check or care about partition and file system sizes. It just assumes.

The issue we had, was that on a CentOS7 default installation, the /var directory is part of the root file system, which is very small (15-20 Gb), as it should normally be (you generally separate root file system from apps and data).

When you would start Nova, even in debug mode, you never saw an ERROR about the fact that Nova had issues with any of its filters (disk, ram, compute, et al). They were being written into the log as DEBUG and WARNING. This made finding the problem like finding a needle in a haystack. And you only saw this evidence after enabling debug in the /etc/nova.conf file.

Eventually, after enabling debug and combing through the logs (on both Controller as well as Compute node), we found a message on the Controller node (NOT THE COMPUTE NODE WHERE YOU WOULD EXPECT IT TO BE) about the disk filter returning 0/1 hosts.

So - we moved the /var/lib/nova to the /home/nova directory (which had hundreds of Gb). We also changed the home directory of nova in /etc/passwd to /home/nova (from /var/lib/nova).

We got further...but it was STILL FAILING.

Further debugging indicated that when we moved the directory, we forgot about another variable in /etc/nova.conf called the lock_file_path. Used for RabbitMQ communication, this variable was still pointing to a lock file in /var/lib directory (that had been moved to /home). This caused Compute Filter issues - also showing up as DEBUG and WARNING messages and not errors.

Ubiquiti EdgeRouter X - Power Over Ethernet (POE)

I tested the POE feature of the router this week. I will cover what I did, and the results.

I had a Polycom VOIP phone that was POE-enabled, and I did not have the power supply for it; hence the use case, and excitement about leveraging this feature when I started using this router.

By default, this router uses eth0 as the Management port (192.168.1.0/24), and eth1 as the WAN / Internet port.

This assumes that only a single ISP is being used with the router which is a good assumption considering this is a small consumer appliance and not an Enterprise router. The router actually supports dual ISPs (link aggregation), which is certainly an Enterprise feature, but to employ this feature you need to change the ports such that you are using eth0 (leftmost port) and eth4 (rightmost port).

The POE feature on this router is POE Passthrough, which means that you need to insert a POE-enabled device into the "input port" - which happens to be eth0 (leftmost port), and POE output will be supplied on the passthrough port eth4.

First, I checked the line to make sure it had POE, by connecting it to the phone directly. Voila', it powered up. So, knowing POE was indeed enabled on the input CAT-5 from the patch panel, I set about trying it on the router.
Next, I had to reconfigure the router so that eth0 was the WAN port. This meant removing (or moving) the Management port. Since a Management port is handy to have, I made eth1 the Management Port - essentially switching eth0 and eth1. I will skip the details on how to do this, but basically you can do this through the menu system of the device by going to Dashboard, Services and Firewall.
I then connected the POE VOIP phone to eth4 - directly. A common mistake is to use a switch, but if the switch is not a POE switch, this won't work (you can also burn up the switch this way). I made sure that eth4 was on the Switch! For some reason, only eth2 and eth3 were on the switch (boxes ticked). I ticked the eth4 box.
On the eth4 POE Output interface is where you enable POE. I was surprised you could not enable it on eth0 input link. I enabled this. Keep in mind the router is STILL PLUGGED IN!

The POE VOIP phone did not power up. One theory is that the router could not power up the phone on its own, based on just the voltage from the input CAT-5 (keep in mind it also needs to power the router itself). We then powered up the router with the adaptor, thinking this might supply more juice (voltage) to power up the phone. Nope.

I did not have more time to debug this. I disabled POE on eth4, and used a POE Adaptor, which works just fine. Maybe I can attempt this later, but at first test, POE did work. For me. For my phone. This is the first time I have actually used POE.

Thursday, December 7, 2017

OpenVSwitch Round III

Turns out that when I rebooted my boxes, nothing worked. I debugged the issue back to OpenVSwitch.

The problem is definitely the provider bridge on the Controller / Network node.

I *think* the problem is that I am using only a single interface on the Virtual Machine, and using that same interface for two purposes:

1. vxlan tunneling to OpenStack virtual machines (via the br-tun OpenVSwitch managed bridge.)
2. internet connectivity (via the br-provider OpenVSwitch managed bridge)

I have to verify this, but that is the hunch.

I have backed out OpenVSwitch once again, reverted back to linuxbridge-agent, and more research to follow.

NOTE: To use two interfaces on the VM means quite a bit of work. The interface ens3 on the VM connects to a bridge that connects to an adaptor on the host. I probably need to use two adaptors on the host, and I probably need additional network(s) created, which has routing implications.

Essentially it's a network redesign - or, we could say, expansion.

What to do if your VM won't boot

After I made a bunch of changes to OpenVSwitch (see last post), I rebooted my VM and it just hung with a "random init crng" message. The last thing it did was bring up a network interface ens3 before this message appeared. I assumed it was trying to start networking, and was waiting for a timeout.

I waited for what seemed like an eternity...5 minutes? 10? Longer? And a prompt came up. I saw little to nothing in the logs, so I became concerned. How would I debug this if I have to wait that long every time I reboot?

I found a few nifty tricks to get you out of a jam if you're virtual machine won't boot.

First, if the VM will come up at all, you can enable "system-debug" with systemctl:
# systemctl enable system-debug

This service allows you to connect on tty9 to the VM as it is booting. On a virtual machine, you typically need to "send" a command to get to this terminal. So on KVM, there is a "Send Command" option and you can send keystroke commands to the VM (i.e. Ctl-Alt-Backspace, Ctl-Alt-F1...F9, et al). This proved to be quite handy, and a sense of relief that I could actually get inside of the VM - especially if you have not snapshotted it or backed it up.

If you need to do more, or go further, you may need to interrupt the boot process and edit the line in the grub config file. For Red Hat users, here is a link to a page that gives some direction on how to get into debug mode for a virtual machine.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sec-terminal_menu_editing_during_boot

Tuesday, December 5, 2017

OpenStack and OpenVSwitch - Round II

Yesterday I installed OpenVSwitch in conjunction with OpenStack and was excited to see that it worked - first time - after I installed and configured it.

To review, there are two compute nodes. These are NOT virtual machines; they are kvm / libvirtd machines (each kvm host is a compute node). And there is one OpenStack Controller that IS a virtual machine, and this virtual machine also serves as the OpenStack Network node (it is common to see the Controller and Network node separated, but I run them on the same VM).

The two OpenStack Compute Nodes appear to be working perfectly. Each of these has two bridges:
1. br-tun - used for tunneling to the controller
2. br-int - used as an integration hub (virtual patch cords between bridges)

The Controller node, however, is a bit different in that it needs to have an additional bridge called br-provider. This bridge connects out to the internet.

I am taking this architecture STRAIGHT out of the OpenStack documentation (I am running Newton but these diagrams are also identical and present in the Ocata release of OpenStack), which is at this website: https://docs.openstack.org/newton/networking-guide/deploy-ovs-selfservice.html

Below is a diagram - their diagram - shown here for convenience - to illustrate what I am trying to achieve:

Now in the OpenStack Install Guide, for Newton, they don't have you creating this "Network Node". All of the stuff on a Network Node is on the Controller Node. So the Network Node and Controller node are combined. While I am considering breaking the Network Node out into a separate VM or on a separate small box, currently my Controller and Network Node are all on the same CentOS 7 virtual machine running on a libvirtd on one of the Compute Nodes - external to OpenStack though.

Having it this way can be difficult in some ways. For instance, I have to run an OpenVSwitch Agent on this Network Node - and this is not well-documented (documentation leads you to believe you only need the linuxbridge agent or openvswitch agent on the Compute Nodes). Another issue with this is that there are ports (tap) for every dhcp agent, and every instance, which gets confusing when you look at your OpenVSwitch and see all of these, with these cryptic names beginning in qr, qq, etc. You don't always know which is which and where to start debugging.

Anyway...I had a VERY difficult time getting OpenVSwitch to work. The first issue I had was that the bridges in OpenVSwitch did not seem to have the right interfaces. This was a configuration issue. Another issue had to do with the br-provider bridge. I actually thought I should put those on the Compute Nodes even though the diagram did not have those. I didn't understand why the Compute Nodes could not go straight out - and why they needed to be tunneled over to the Controller / Network nodes, which seemed like a huge unnecessary performance hit.

But the MAIN issue I had was that once configured and after I instantiated virtual machine instances, they would come up but without an IP Address. I set the instantiated VM's IP Address statically, and lo and behold, it could communicate just fine. This told me that the issue was DHCP; that the VMs were not getting an IP Address via DHCP. At least I knew, but this concerned me because it did not seem like it would be easy to debug or fix (it wasn't).

2. I checked the dhcp agent services. They were running. I checked the logs. They looked okay. I also did the same for the l3 agent.

3. I then mentioned this to my boss, and he suggested that the OpenVSwitch was not handling broadcast messages. He suggested I trace the broadcast by checking the endpoints and working in from there.

This led to me having to go into the namespaces with "ip netns exec". What I found was that the interfaces inside of the namespaces had "@" characters in the names, making it difficult to run tcpdump.

I finally ran across the ovs-testing package, which has some utilities such as ovs-tcpdump that help debug bridges and interfaces inside the switch. When I did this, I found the broadcast traffic reaching the OpenStack Controller.

Finally - I realized that I needed to DELETE the dhcp agents and RECREATE them. The original agents were created when the networks were created, using the linuxbridge. Apparently switching to OpenVSwitch rendered these agents inoperable. They could not "adapt" to that change. So - I deleted those dhcp agents, OpenStack recreated them, and now things are working just fine.

Monday, November 27, 2017

Elasticity and Autoscaling - More Testing

Today I went back and did some further testing on AutoScaling.

Here are some new things that I tested:

1. Two Scaling policies in a single descriptor.

It does no good to have "just" a Scale Out, if you don't have a corresponding "Scale In"!

You cannot have true Elasticity without the expansion and contraction - obviously - right?

So I did this, and this parsed just fine, as it should have.

I also learned you can put these scaling directives in different levels of descriptors - like the NSD. If you do this, I presume that what will happen is that it will factor in all instances across VNFMs. But I did not test this.

2. I tested to make sure that the scaling MAXED OUT where it should.

If cumulative average CPU across instances was greater then 35% CPU, then the SCALE_OUT 3 would take affect. This seemed to work. I started with 2 instances, and as I added load to CPUs to boost the cum average up, it would scale out 3 - and then scale out 3 more for a total of 8 no matter what load was on the CPUs. So it maxed out at 8 and stayed put. This test passed.

I was curious to see if the engine would instantiate one VM at a time, or would it instantiate in bunches of 3 (per the descriptor), or would it actually just instantiate up to the max (which would be errant behavior). Nova in OpenStack staggers the instantiations so it APPEARS to look like it is doing one at a time up to three (i.e. 1-1-1), at which point re-processing may kick off another series of 1-1-1. So this is probably to-be-expected behavior. The devil is in the details when it comes to the Orchestrator, OpenStack, and the OpenStack Nova API in terms of whether it is possible and to what extent you can instantiate VMs simultaneously.

When a new VM comes up, it takes a while for it to participate in measurements. The scaling engine would actually skip the interval due to a "measurements received less than measurements requested" exception and only start evaluating things until and unless it had all of the VMs reporting in measurements that were expected. I have to think about whether I like this or not.

3. Elasticity Contraction - by using SCALE_IN_TO parameter.

I set things up so that it would scale in to 2 instances - to ensure at least two instances would always be running. I would do this when cumulative average CPU was less than 15% CPU across instances.

This test, actually, failed. I saw the alarm get generated, and I saw the engine attempting to scale in, but some kind of decision-making policy was rejecting the scale in "because conditions are not met".

We will need to go into the code and debug this, and see what is going on.

Thursday, November 23, 2017

Ubiquiti Edge Router ER-X - Impressive

I just love this router.

The icon on the top that shows colorized ethernet plugs (colorization related to status). Cool.
It was sooooo easy to configure it.
It has a shell that takes you into Ubuntu Linux (uses Ubuntu = impressive). I'm not even sure it is using BusyBox or some slimmed down quasi-linux. It looks like a custom compile of Ubuntu.
One cool feature is that it can support link aggregation. I am not using that feature, but it's cool.
Has excellent support for IPv6.

It can also automatically switch the ports on the router so that you don't need a L3 switch to go with it.

So for example, you can set up:

eth0 as the management port
eth1 as the WAN port, and
eth2, eth3 and eth4 are switched so that anything plugged into these are on the same network (you defined the network).

I don't currently have this router configured a router than actually learns routes. In other words, I am not running BGP, RIP or OSPF on it. I don't really need to learn networks, nor advertise networks, dynamically. All I need is to get out to the internet.

I have it set up with a hairpin NAT, and the firewall rules configured on it are rather trivial at the moment but designed to protect ingress through iptables rules.

This is truly a "power user" routing device, and it can fit into the palm of your hand; it is no bigger than a Raspberry Pi device.

This router also comes with some interesting Wizards that allow you to configure the router for certain use cases, like the WAN+LAN wizard.

So I have not done anything in-depth, but I spent an hour messing around with this device and I'm pretty impressed with it.

Security - Antivirus specifically

My McAfee just expired on this computer and I am now getting a bunch of intrusive "buy me" pop-ups. I have never thought McAfee to be top of the line when it comes to Anti-Virus, but the question is, is anybody really stopping viri these days?

I have started to get smarter about Security. I went to RSA in 2016, and I bought a book on Exploits. This is very very hardcode book, and I have not managed to get through it all yet. It requires Assembler and C programming, and teaches you how hackers actually exploit code. I think once I finish this it will be awesome knowledge, and I am about halfway through it. I got pulled off of this due to the longer hours at work playing with virtualization and orchestration.

So - I am not current on malware. So I spent some time looking around this morning, reading anti-virus reviews.

It does not appear that there is much out there in the way of OpenSource AV. ClamAV looks like the only thing actively maintained. This is a bit of a surprise.

There are some free packages out there, but I am sure they probably nag you incessantly to buy or upgrade. The big question is this: Can you really trust FREE?

I also see some interesting Cloud-based packages out there that are working from outside your network. This would have been an absolute no-no for me in earlier times, but considering the danger of today's malware, maybe this kind of approach is worth re-examining, is good results are coming from it. One such company is Crystal Security.

I see some products like VoodooShield. And some new ones I had not previously encountered like GlarySoft Malware Hunter.

Of course, Kaspersky, ESET - these guys always get good reviews.

It is probably good to stay up to speed on this stuff. To take an hour here and there and stay current.

OpenBaton Fault Management and AutoScaling

It has been a while since I have taken any notes or blogged anything. That doesn't mean I haven't been doing anything, though. 😎

Over the last month or so I have been testing some of the more advanced features of OpenBaton.
- Fault Management
- Auto Scaling
- Network Slicing

These have taken time to test. I was informed by the development team that the "release" code was not suitable for the kind of rigorous testing I planned to do, and that I needed to use the development branch.

This led me down a road of having to familiarize myself with the "git" software management utility. I know git has been around for a while and silently crept in as almost a de-facto standard for code repository and source management. In many shops it has replaced the classic stalwart Clearcase, CVS, SVN and other software that has actually been in use for decades. Even in my own company's shop, they brought in a "git guy", and of course since that is the recipe he cooks, we now use that. But up to this point, I had not really had a need to do more than "git clone". Now - I am having to work with different branches, and as I said, this took some time. Git is fairly simple if you do simple things, but it is far more complex "under the hood" than it looks, especially if you are doing non-simple things with it. I could do a post just on git alone. I'm not an authority on it, but have picked up a few things - including opinions - on it (some favorable, some not).

The first thing I tested was Fault Management (FM). Fault Management is essentially the ability to identify faults and trigger actions around those faults. The actions can be an attempt to heal - or it can be an attempt to Scale, or it can be an attempt to fail-over based on a configured redundancy mechanism. The ETSI standard descriptors allow you to specify all of this. The interesting thing about FM in a virtualized context is that it gets into the "philosophy" of whether it makes sense to spend effort healing something, as opposed to just killing it and re-instantiating it. This is called the "Cattle vs Pets" argument. I think there ARE cases where you need to fix and heal VMs (Pets), but in most cases I think VMs can be treated as Cattle. When VMs are treated as Pets, the nodes are generally going to be important (i.e. they manage something important, as in a control plane or signaling plane element), and cannot just be taken down and re-instantiated due to load or function.

I then tested AutoScaling - or, using a better term, Elasticity. This allows a virtualized network to expand and contract based on real-time utilization. This feature took me a while to get working due to having to compile different modules of code from different moving-target git branches over and over until I could finally get some code that wanted to work, with slight modifications and patches. When I finally got this working, it was super cool to see this work. I could do a separate post on this feature alone. After I got it working I wound up helping some other guys in a German network integration company get the feature working.

Network Slicing has been more difficult to get to work. That is probably a separate post altogether, related and intertwined with QoS such topics.

Thursday, September 21, 2017

OpenStack - Two Compute Nodes

Getting two Compute nodes to work was fairly straightforward.

You basically just install openstack-nova-compute, and your Neutron network plugin (linuxbridge-agent in my case).

The only question I had was whether two Compute Nodes can belong to the same OpenStack Region.

Thank goodness I found a ppt where a guy made it clear that one could run a slew of nodes in a single region (he had multiples in Region 1, and Region 2).

At one point, I decided I would install the OpenVSwitch on this second Compute Node. I'll probably write a second post on that. It did not appear to me that you could mix and match OpenVSwitch and LinuxBridge on different Compute Nodes (at least not easily?). This is because the Neutron L3 Agent config file has a driver field and only seems to accept one mode or the other. I could be wrong about this; more testing necessary. But I backed OpenVSwitch out and enabled LinuxBridge-Agent. Things seem to be working very well with the Linux Bridge Agent.

The Linux Bridge Agent creates Layer 2 Tap interfaces and puts these interfaces on a bridge. If you are using VXLAN protocol it will also manage those interfaces as well.

OpenVSwitch

Today I added a 2nd Compute Node (KVM).

I thought I would use OpenVSwitch on it.

This took me down a deep rabbit hole, as OpenVSwitch is a complex little bugger.

I installed the OpenVSwitch package, then the driver agent (on Compute Node). I wanted it to run in a Layer 2 mode because I had LinuxBridge Agent running on the first Compute Node and the Controller.

After setting OpenVSwitch up on the 2nd Compute node, I realized my external NIC was a bridge, so I tried to use veth pairs to make it work. Nope. As it turns out, the Controller (and L3 agent) seems to use drivers for OpenVSwitch OR LinuxBridge (not both). It appears that it is all or nothing and you cannot mix and match between LinuxBridgeAgent and OpenVSwitchAgent.

I backed it out and used / installed LinuxBridgeAgent.

OpenStack Functional Demo

Originally, with one CentOS 7 server (32 Gb RAM) that was set up to run Ansible and LibvirtD at my disposal, I installed Openstack on a single box.

I put the Controller in a VM and used the host as the Nova Compute Node.

I had all sorts of issues initially. The Keystone and Glance were fairly straightforward. I did not have DNS, so I used IP addresses for most urls, which is a double-edged sword. The complexity in OpenStack is with Nova (virtualization management) and Neutron (networking).

I did not create a "Network Node". I used only a Controller Node and a Compute Node. What one would normally put on a Network Node, runs on the Controller Node (L3 agent, DHCP Agent, Metadta Agent).

One issue was libguestfs was not working. I finally removed it from the box only to realizs that there was a yum dependency with the openstack-nova-conpute package. So I installed nova compute using an rpm with the --nodeps flag.

Getting linuxbridge agent to work took some fiddling. One issue is that it was not clear if I needed fo run LinuxBridgeAgent on the Controller. The instructions make it seem that it is only for the Conpute Node. Well, not so. Neutron creates a tap for every dhcp agent, and every port. ON THE CONTROLLER if that is where you run those services. So you install it both places.

The Neutron configuration file...is about 10,000 lines long, leaving many opportunities for misconfiguration (by omission, incorrect assumption/interpretation, or just plain typos). It took a while to sleuth out how OpenStack uses Nova, Neutron and the l3 agent and linuxbridge agent to create bridges, vnets and taps (ports). But - confusing again - is whether you need to configure all parms exactly same on both boxes, of if some are ignored on one node or the other. I was not impressed with these old style ini and config files. Nightmares of complexity.

Another major challenge I had was the external network. I failed to realize (until I did network debugging) that packets that leave the confines of OpenStack need to find their way back into OpenStack. This means having specific routes to internal OpenStacks networks via the OpenStack external gateway port on the OpenStack router from VMs sitting outside OpenStack.

Another confusing thing is that OpenStack runs namespaces (separate and distinct network stacks) to avoid IP Overlays (by default - the way Neutron is configured). Knowing how to navigate namespaces is / was a new topic for me and makes it harder to debug connectivity issues.

Finally, when I worked all of this out, I realized that the deployment of VMs was taking up almost 100% CPU. This led me down a rabbit hole to discover that I needed to use the kvm virt_type, and a CPU mode of host-passthrough to calm the box down.

Once I got this done, I could deploy efficiently.

Another thing (maybe this should be its own post) is the notion of setting ports that you can use on deployment (instead of saying "deploy to this network", you can say "use this port on this network" - which has its own IP and port assignment). Because you can attach multiple submets to a single network, I figured I could create ports for nodes that I wanted to reside on that submet. And I COULD! But - the ETSI MANO standards have not caught up with this kind of cardinality / association (per my testing anyway) so it only works if you use OpenStack GUI to deploy. Therefore, having a "one subnet to one network" rule is simpler and will work better for most situations I think.

In the end, I was able to do everything smoothly with, OpenStack. Save Images, create Flavors, Networks, and Deploy. But it all has to be configured "just so".

Sunday, September 17, 2017

Service Orchestration and Automation with Open Baton

I haven't blogged in a while, but I've been busy working on Cloud Automation technology of late.

Originally, some guys in the company did an evaluation between Puppet, Chef and Ansible for automating the deployment of virtual machines into the Cloud (they hosted their own cloud and did not rely on the commercial cloud providers we see today).

It took me a while, but I finally had the time to examine their stuff, and before long I was hacking the scripts for my own purposes, so that I could build different versions of our SD-WAN solution, and different topologies of this solution (e.g. we had an L2 solution, an L3 solution, an L3 solution with Routing, et al). I fell in love with Ansible. I could spin a virtual network up in a matter of minutes, and I could start with raw virtual machines (Linux - CentOS) that would download the packages and install them (with yum installer), install and configure the software, etc. I could probably write a book on the topic of Ansible alone. But - I took someone else's hard work, and ran with the ball and it is always easier to do that then start from scratch yourself.

Then - I was asked to get a prototype of ETSI Mano working.

Years back, when I was at Nokia Networks, we examined Service Orchestration, but back then there were no standards and it was a HUGE integration clusterf$k to get that kind of technology working. We tried it with BEA and JNetX, and message queues. It was a mess.

This time, I read through the standards, and indeed, it looked to me like we HAVE standards drafted up. But - do we have any working solutions? I looked at a solution called OpenBaton, which is open source, out of Berlin, Germany. I put it on a box, went through the tutorials, and it seemed to "kinda sorta" work. So I was able to get this working with a stub "dummy" module that doesn't do anything.

Originally, I put OpenBaton on one virtual machine. It is designed to run on Ubuntu 14.04 (at least that is what the developers tested it on). So, not being heavily familiar with Ubuntu I installed 14.04 on a virtual machine on a KVM Host (32 Gig RAM, 8 core CPU and 1Tb Disk) and immediately the Ubuntu upgraded it to 16.04. This created some problems right away. One HUGE issue is that all of the software is written in Java, and they stated that they wanted JDK 1.7. But - guess what? Oracle had just deprecated 1.7 that very week, and took the 1.7 JDK link down, which broke all of Open Baton's scripts. Don't ask me how I got around this...it was very difficult. I installed OpenJDK 1.7, and then "faked things" so that Open Baton's scripts would believe Oracle JDK was on the box. I wound up having to download and compile many packages from GitHub and compiling them myself. I also wound up having to hack and manipulate the SystemD unit files so that the services would start up properly.

Initially, I only installed the Orchestrator (NFVO), and the Generic VNFM (Virtual Networking Function Manager) modules. But, to really vet the technology out, OpenBaton needs a "real" system to talk to. So, in a 2nd Virtual Machine on the KVM host, I installed an OpenStack Controller Virtual Machine on Centos 7, and it ran along side the OpenBaton Virtual Machine. On the KVM Host, I installed the Neutron Compute Module, which is responsible for interacting with the KVM host and launching the virtual machines.

I got it to launch machines, but that gets boring quick. I wanted to examine the ability to run scripts and configure the VMs dynamically, and have the VMs inter-communicate. I then learned that OpenBaton - though called an Orchestrator - cannot actually pull any of this off without using an EMS (Element Management System). And Open Baton uses Zabbix for this. So, I had to install a Zabbix Server and a Zabbix Plugin - and I installed these on the Open Baton Virtual Machine, thinking that I would alleviate issues if I put them all together on the same box (more on this later).

In the end, I am able to get Open Baton to launch VMs consistently, but I get a TON of timeout errors. As I debug things, I realize that threads and message queues are timing out because the process of deploying and configuring the VMs is so CPU and Disk intensive that the VMs just get overwhelmed and Open Baton gets impatient waiting for things to happen.

I run top (as well as htop and other tools) and realize that I need to take a step back if I am going to take a step forward. I need to get another box - a second box - and distribute some load, and move some things out of these virtual machines.

Okay, that's it for now. I will update more on the next post.

Wednesday, July 26, 2017

NetFlow with Ntop

I had heard that Ntop supports Netflow on Linux.

I found a link / blog where someone else has played with this package for same or similar purposes. Let me share that here:
https://devops.profitbricks.com/tutorials/install-ntopng-network-traffic-monitoring-tool-on-centos-7/

I downloaded the Ntop package, and immediately it barked about the fact that I did not have kernel headers on the system.

This is bad, in my mind.

What box, running out in the field, would have kernel headers installed on it? That would be a bad security practice because it would mean that the box has a lot of stuff on it that it probably shouldn't have...specifically this would mean compilers, et al?

I also noticed that the package runs with a license code. There is a limited license it can run as, which is default configured. But I'm not sure I like having software, at least for this purpose, that is dependent on licensing. I did not study whether it is a key license that is time expired, or if it calls out to a remote server to authenticate the license, et al.

I kind of stopped there. I did not play with it any further. I may come back to it, and if I do I will update this accordingly.

Saturday, July 22, 2017

NetFlow with nfcapd and fprobe

I spent some time researching and using NetFlow this week (about a day).

Basically, you download the nfdump package, which has the collector (nfcapd), and a GUI (nfsen) and a command line tool called nfdump.

You run the collector, which listens on a standard or specified port, and "something" (i.e. a router) that knows how to capture flows, will write netflow formatted files. Then you can use nfdump or nfsen to view these flows.

There are multiple versions of NetFlow - from version 5 all the way up to 9 (see the NetFlow Wiki). The different versions provide additional data (or extensions as they refer to them).

The tricky part in testing this is to mimic or simulate a router. To do this:

fprobe is a tool you can install to generate flows. But it does not appear to install with the yum package manager, so you need to download the source and compile it, or there is an rpm that can be downloaded and installed.

frpobe-ulog is another tool, but it runs over iptables and requires iptables rules to work. I was surprised to see that yum COULD find and install this program, but not fprobe.

There are a few other tools as well, but these were the two I tried out.

Both of these worked, although there is not a lot of documentation or forum discussion on the fprobe-ulog approach. I wound up using fprobe.

There is the question of what defines and constitutes a network flow. The Wikipedia defines this. I think that if you have a bunch of udp traffic, it is harder for Netflow to stitch the traffic together into a flow for hindsight analysis. But TCP of course is straightforward.

System Tap

I spent some time reading the Beginner's Guide to System Tap.

https://sourceware.org/systemtap/SystemTap_Beginners_Guide/

I learned the basics of writing, reading and compiling / running the system tap scripts.

I also enjoyed running the sample System Tap scripts that are mentioned specifically in Chapter 5 of this guide.

I wound up downloading a bunch of these; especially the networking ones.

Writing these efficiently would take some practice. But there is a good Reference Guide that can make the process of writing the scripts easier. The question is, could there be a use case for writing one of these scripts that someone hasn't already thought up, and written?

Wednesday, July 19, 2017

Ansible Part II

I've had more time to play with Ansible, in the background. A little bit at least. Baby steps.

I use it now to deploy SD-WAN networks, which have different types of KVM-based network elements that need to be configured differently on individual virtual machines.

I enhanced it a bit to deploy virtual-machine based routers (Quagga), as I was building a number of routing scenarios on the same KVM host.

I have made some changes to make Ansible work more to my liking:

1. Every VM gets a management adaptor that connects to a default network.
2. The default network is a NAT network that has its own subnet mask and ip range.
3. I assign each VM an IP on this management network in the hosts file on the KVM host.

The ansible launch-vm script uses the getent package to figure out which IP address the VM has by its name, which is defined in the inventory file.

Because the adaptor type I like to use is Realtek, I had to change guestfish in the launch-vm script to use adaptor name ens3. I also had to change it to use an external DNS Server, because the lack of a DNS server was causing some serious issues with the playbooks not running correctly; especially when they needed to locate a host by name (i.e. to do a yum install).

This ansible has turned out to be very convenient. I can deploy VMs lickety split now, freeing up the time I would normally spend tweaking and configuring individual VM instances.

I'm thinking of writing my own Ansible module for Quagga set up and configuration. That might be a project I get into.

Before I do that, I may enhance the playbooks a bit, adding some "when" clauses and things like that. So far everything I have done has been pretty vanilla.

Wednesday, July 5, 2017

Quagga Routing - OSPF and BGP

For the last month or two, I have been "learning by doing" routing, using Quagga.

Quagga uses an abstraction layer called Zebra that sits (architecturally not literally) on top of the various routing protocols that it supports (OSPF, BGP, RIP, et al).

I designed two geographically-separated cluster of OSPF routers - area 0.0.0.1 - and then joined them with a "backbone" of two OSPF routers in area 0.0.0.0. I hosted these on virtual machines running on a KVM host.

From there, I modified the architecture to use BGP to another colleague's KVM virtual machine host that ran BGP.

Some things we learned from this exercise:
1. We learned how to configure the routers using vtysh,
2. We had to make firewall accomodations for OSPF (which uses multicast) and BGP.

We also used an in-house code project that tunnels Quagga. I have not examined the source for this, but that worked also, and required us to make specific firewall changes to allow tunx interfaces.

Percona XtraDB Cluster High Availability Training

Tuesday 5/23 through 5/26 I attended Percona Cluster High Availability training.

The instructor was very knowledgeable and experienced.

One of the topics we covered was the "pluggable", or module-based, architecture that allow different kinds / types of engines that can be used with the MySQL database. He mentioned InnoDB, and how the Percona XtraDB is based off of the InnoDB engine.

We also covered tools and utilities, not only from MySQL, but 3rd Parties, including Percona (Percona Toolkit).

We spent some time on Administration, such as Backup and Recovery.

We then moved on into Galera replication, and used VirtualBox images to create and manage clusters of 3 databases.

I won't reproduce a full week of rather complex training topics and details in this blog, but it was good training. I will need to go back and revisit / review this information so that it doesn't go stale on me.

Tuesday, May 16, 2017

Learning Ansible: KVM Deployment Use Case

"Pioneers get shot in the back", is what Stan Sigmund (do I have that spelled right?), the CEO of at&t used to say. Well, I don't know this firsthand. This is what some at&t employees told me once.

But it's true. It's always a lot safer to go in after the initial wave of invaders have taken all of the risk, and I think that's what Stan would have been referring to with that statement. It's about risk, which is a topic in an of itself, very blogworthy.

How does this relate to Ansible?

We have an engineer here who likes to run out in front of the curve. He did all of this research on Puppet, Chef, and Ansible, and chose Ansible. There are any number of blogs that tout the benefits of Ansible over these others, but in order to fully grasp those benefits, you need to study them all.

For me, I need to learn by doing, and then I can start to understand the benefits of one vs another.

So, I have started by taking a number of playbooks, and trying to get them working on my own system. I built a KVM host environment on a 32Gb server, and it made sense to see what I could do in terms of trying to automate the generation and spinup of these Virtual Machines.

There are a number of new things I have come across as I have been doing this:

1. Guestfish - Guestfish is a shell and command-line tool for examining and modifying virtual machine filesystems.

http://libguestfs.org/guestfish.1.html

2. getent - a small IP / host resolver that is written in Python.

https://pypi.python.org/pypi/getent

The scripts I am using are all set up to create a virtual machine using some defaults:
- default storage pool
- default network

Certainly this is easier to do than creating one-offs for every VM. But if you do this, you need to go into virt-manager and reprovision the networking and other things individually. Which kinds of defeats the purpose of using ansible in the first place (you can use a bash deploy script to generate a KVM).

So one of the things I did have to do was to hack the scripts to work with the storage pool I was using, which placed all of the images in MY directory, as opposed to where the default images were being placed.

Somehow, I need to enhance these scripts to put each VM on its own network subnet. This can all be done with virsh commands and variables, but I have not done that yet.

One problem, is that you need a MAC address to assign your adaptors if you're going to try and create those dynamically. I looked, and came across this link that can possibly serve as a weapon for doing this:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Virtualization/sect-Virtualization-Tips_and_tricks-Generating_a_new_unique_MAC_address.html

I have a handle on Ansible now; what a Playbook is, the Inventory File, what Tasks are, Roles are, Handlers, and the like. I understand all this, but can I swiftly and efficiently code all of this? No - not yet. I'm still reverse-engineer hacking from existing stuff. The background as an Integrator has honed those skills pretty well.

Ansible is as good as the underlying inputs that are fed into the process of generating outputs. It can be simple. It can be complicated. My impression is that it makes sense to crank it initially, and then enhance and hone over a period of time. Trying to everything up front and in one shot will be a huge time sink.

I'll probably write more about Ansible later. This is all for now.

Thursday, April 20, 2017

OpenDNP3: What the *&^% are all these gcda files?

I downloaded some open source, and made some customization to it. The framework was in C/C++, and architected at a level that is more complex and sophisticated than many code bases I have seen.

I noticed that when I would run the application as a power user (i.e. root), it would write out a bunch of ".gcda" files. And, one time when I ran it as a non-power user, it had trouble writing those files out, and the application produced errors (it may not have even run, I can't remember).

Well, tonight, I finally looked into the topic of what a gcda file actually is.

It is a code coverage, or code profiling tool.

You compile with a certain flag called --coverage (using gcc on Linux here), and then the GCOV framework enables these gcda files to be generated. These are binary statistics files that are updated over time as the program is run, and upon proper exit of the application.

http://bobah.net/d4d/tools/code-coverage-with-gcov

Tuesday, March 28, 2017

Deploying Etherape on a non-Development system

Go to Etherape web site. Read package dependencies.

1. Download gtk+-2.24.31 sources

1a. Run "make configure"
- Install pango-devel
- Install atk-devel
- Install gdk-pixbuf2-devel

1b.
Re-ran "make configure" and passed dependencies, then ran "make" and "make install"

2. Downloaded libglade-2.6.4 sources

2a. Ran "make configure"

- Install libgnomeui-devel

3. Downloaded Etherape 9.1.4 sources

3a. Ran "make configure"

- Install libpcap-devel

- Install gnome-doc-utils

NOTE: I got some kind of error on a documentation package, but decided it was not critical to Etherape actually working.

3b. Ran "make" and then "make install"

Thursday, March 16, 2017

NetFlow Kernel Module Programming

I have been doing some kernel module programming.This is not for kids.

Most examples on this are on kernels that pre-date the 3.10 kernels now in use (in other words, 2.6 kernels are the examples I mainly see that show how this magic is done).

But I've learned a bit from doing this. When I finally got into the more advanced kernel modules, where you need to start accessing data structures in C Programming language from the kernel headers, stuff started to not compile and I started to learn that the data structures have changed, et al.

The ultimate end to this is to write your own firewall using NetFlow. Will take some work.

But learning the NetFlow architecture, and how a packet traverses the NetFlow tables is very valuable because iptables is built on NetFlow.

I could write a lot more on this - but I'd bore you. I've compiled a lot of information and subject matter on this.

Dell PowerEdge R330 - Lifecycle and iDRAC

For the first time in years and years; maybe ever to this extent, I delved into the guts of a hardware platform; the Dell PowerEdge platform.

We order a lot of these where I work; Dell R220 (originally), Dell R230, Dell R330 and Dell R430.

Dell R430 - Carrier Grade (redundant and scalable)
Dell R330 - Enterprise Grade (has redundancy; drives, RAID card, power supplies)
Dell R230 - Commercial / Consumer Grade (weaker computing power, no redundancy)

These go up, actually, to a R7xxx series (I know someone who bought one of those - an R710), but we don't go that high where I work.

I have played with these boxes quite a bit; adding memory, auxiliary network cards, and in one case had to set a jumper to clear NVRAM on the box. One a few boxes in the earlier days, I would configure RAID on them, and partition the drives in the CentOS installer (a Kickstart process takes away that fun for us nowadays).

One thing I have done, is install iDRAC cards into boxes that were not initially ordered with iDRAC cards. I learned that if you buy the wrong ones, they might be compatible with the box, but they may not have the screw holes to mount them on the motherboard (I had to return those).

Lately, I have been playing with the iDRAC and Lifecycle Controller functions on the Dell R330. I've learned that there are numerous version of iDRAC (newer boxes happen to be running iDRAC 8 while the ones from the last couple years are on 6 and 7). Dell has documentation on these versions, which use a primitive command line (CLI) syntax that has not changed much since I originally used RACADM in the 90s.

I also played with the OS-Passthrough feature. You can direct cable with a CAT5/6 the iDRAC and a spare port on the box, and put static IPs on both of those ports and create a closed-loop out-of-band management LAN without actually cabling the box into an external network infrastructure of any kind. This allows you to VPN or tunnel into a box, and then access the local management network to get into iDRAC. You do have to cable it though - there's no way to create a virtual LAN (that I saw). You can add another IP for Lifecycle Controller if you set that statically, and have 3 IPs; one for Lifecycle Controller, one for iDRAC, and then the IP that the Operating System statically assigns when the OS comes up.

iDRAC has a web front-end that can be configured and enabled. Licensing guides on what can be done in the GUI, whereas when you use the CLI the licensing does not seem to be very informative to the user on what restrictions might be in play.

I never did get the Life Cycle Controller web interface to work, if that even exists (maybe it has a client or software that runs remotely and accesses that - looking into). So this software as it stands, appears to only work if you're on the physical console of the box and access it via the F10 key at bootup.

Trying to learn some more but at this point, this is what I have learned.

Ansible Part I

Now that I have a bunch of VMs running on a KVM host and interfacing properly with proper network configuration, the next thing that would be good to do is to learn how to deploy these VMs in an efficient way.

Right now, I have bash scripts that generate the VMs. I have one for Spice Graphics, and another without Spice Graphics for a non-graphical Minimal CentOS. Once these OS images are installed, though, I have to do considerable tweaking to get software installed and configured on them.

This is where ansible comes in.

I have a book on Ansible - and a number of Ansible scripts and playbooks.

I have not had time to read the book, nor play with the playbooks, but I did have sense enough to delete all of the inventory files. The last thing you want to do is start running playbooks and farting up someone else's virtual machines using incorrect inventory.

So that's where we are....nowhere really, except an intent to get smart about Ansible.

Ansible - is an alternative to Chef and Puppet. I know a guy who did research on all of these and chose Ansible. So that's the history on "why Ansible".

More Work on KVM - Network Configuration

Been a while since I have posted anything on here. I'll do a few updates.

One of the projects I have been working on is the transition from Virtual Box (which I run on a Windows 10 laptop) and ESXi (which we used to run on large servers) to KVM.

What I have been doing is installing an entire network on a KVM host - with different CentOS 7 virtual machines.

Initially when I did this, I put each one of these on their own subnet (default network). Then - when one of the VMs needed a static IP, I learned how to use the virsh commands to edit the xml file for the default network and insert DHCP ranges, and - of these DHCP ranges - lock a specific IP to a specific host / MAC.

What I really meant to do, was to go back and reconfigure the network to resemble the Virtual Switch mechanism that ESXi provides through the user interface. But I could not - easily - figure out how to do this.

Later, a young greenhorn developer mentioned to me that the "Connection Details" tab in the Virt-Manager GUI would allow you to add/remove and start/stop various networks. And in exploring this, I learned that you can create Routed networks, NAT networks, and custom versions of these. You can also create internal networks.

It appears that you can "enable static routes" on both the NAT and Routed networks - a little confusing but made sense once you started trying to interact between VMs. I had some issues getting NAT networks to interface with Routed Networks until I wised up and, for the VM that needed internet access, created two network interfaces on that VM; one using a NAT network (external internet) and one using Routed (for internal network that could interface with other Routed VMs).

With that I was able to create 7-8 VMs that could interface with one another, and one of those VMs could get out to the internet as required.

There might be more sophisticated things you can do, but I think if you understand the types of networks and how to properly configure them, you should pretty much be where you want to be. I might need to read up on more advanced aspects of KVM but I think I'm good for now.

Monday, February 6, 2017

Spice Graphics on KVM

I had a virtual machine I had installed with graphics=none on the virt-install script.

I then went and installed X Windows, Gnome, a browser, etc. on this VM, and then tried to start X and the darn thing would not come up. I kept getting a Connection Failed error.

I tried to run X -configure, and it came up empty.

After an hour or more of searching the web, I finally (and this took a while) found someone that was showing a screenshot of the Display on the Virtual Machine Details. This made me realize that I did not have a Display.

Eventually, I realized that I could install a Display of type qx1. But I noticed every time I tried to do this, the VM would go back to the terminal shell and an attempt to start X, or configure X, would fail again.

Finally - I came across a page on Spice drivers, which made me realize I'd installed the VM with no graphics whatsoever.

I finally installed the Spice drivers, and lo and behold, it works great now.

Here is the link I used to do this:

https://www.server-world.info/en/note?os=CentOS_7&p=kvm&f=5