Thursday, February 8, 2018

Object Oriented Python - My Initial Opinion and Thoughts

The last couple of weeks, I have not been involved so much in virtualized networking as I have been writing a REST API Client in Python.

I have a pretty hardcore development background, ranging from various assembler languages to C and C++, and later on, Java (including Swing GUI, EJBs, and Application Servers like BEA WebLogic and IBM Websphere).

Then - just like that - I stopped developing. For a number of reasons.

Actually,  in getting out of the game, I found myself getting into Systems Integration, scope work (writing contracts and proposals), and so forth. Things that never really took me all that far from the original programming.

In getting back into research, the ability to use a vi editor, type 100wpm, and code quickly lends itself very well to doing research prototypes and stuff.

So with Orchestration and NFV, it became apparent I needed to write a REST client so that I could "do things" (i.e. provision, deprovision, MACD - an ops term for Move Add Change Delete).

I wound up choosing Python. And, thanks to quick fingers and google, I was able to crank some code very quickly.

Some things I did on this project include:

Object-Oriented classes with TRUE practical inheritance and encapsulation. Practical in the sense that it makes sense and is not done simply for the sake of exercising the language feature. For example, I created a class called Service, and extended that to L2 and L3 services, and the from there, I extended it further into use cases for an SDWAN (virtual elements that are L2 or L3 devices like bridges, gateways and switches).

I also exercised the JSON capabilities. Python seems to be built ground-up for REST APIs in the sense that it is extremely easy to render objects into dictionaries that in turn become JSON that can be sent up to a REST server.

The last feature I used was something called ArgParser. Initially, I was calling Main with a bunch of arguments. That was not feasible, so I then implemented the optarg functionality (people in Unix/Linux/POSIX are familiar with this). I then learned about ArgParser and refactored the code to use this rather cool way of passing arguments into a main function of a Python class to invoke it as a standalone from a command line.

There are some things to get used to in Python. For sure. It is very very loose. No terminators. No types. Yet, it is picky about how you indent your code (because you don't have the braces and parenthesis). I think it's okay for what it is designed for. For me the struggle was getting used to the lack of precision in it. Also, the fact that it barks at runtime rather than compile time for a lot of things is, in my opinion, quite bad for something you may shove out onto some kind of production server. I'll probably always lean on C / C++ but I do think Python has a place in this world.

Friday, January 19, 2018

OpenStack Networking - More Learnings

For a while now, OpenStack has been working just fine for the most part, except for some of the issues trying to switch back and forth between OpenVSwitch and LinuxBridge agents (earlier posts discuss some of these issues).

This week, I ran into some issues trying to use OpenStack with virtual machines that use MULTIPLE interfaces, as opposed to just a single eth0 interface.

This post will discuss some of those issues.

The first issue I ran into, happened when I added a 2nd and 3rd router to my existing network.

The diagram below depicts the architecture I had designed. The router on the top and bottom were the new routers, and the orange, red and brown networks were newly added networks.
Adding second and third routers to an OpenStack network



The first issue I ran into, was related to the fact that IMMEDIATELY, the DHCP seemed to stop working on any instantiated networks. Stranger than that, the IP assignment behavior became very strange. The VMs might (or might) not get an IP at all, or they might get an IP on their own or each other's network segments. But the IP was NEVER in the DHCP defined range.

I always use a range of .11 to .199 for all /24 networks, just to ensure that the handout of IPs is correct. This paid off, because in situations where the LAN segment was correct, I still found that VMs were getting a .3 address.

So I knew DHCP was not working. I finally realized that the ".3" addresses the VMs were getting, was a Floating IP that OpenStack hands an instance when DHCP is not working - sort of like a default.

So why was DHCP not working?

The answer - after considerable debugging - turned out to be the "agent" function in OpenStack.

If you are running LinuxBridgeAgent, you should see something that looks like this when you run the "neutron agent list" command:

Proper listing of Alive and State values on a Neutron Agent Listing in OpenStack
 Keep in mind, that there are two fields here to consider. One is the "Alive" field, and the other one is the "State" field.

Somehow, because I had at one time been running OpenVSwitch on this controller, I had entries in this list for OpenVSwitch agents. And I found that while the Alive value for those was DOWN, the STATE field was set to UP!!!!

Meanwhile, the LinuxBridge Agents were in a state of UP, but the STATE field was set to DOWN!!!

These were reversed! No wonder nothing was working. I figured this out after tracing ping traffic through the tap interfaces, only to find that the pings were disappearing at the router.

I wound up doing two things to fix this:
a. I deleted the OpenVSwitch entries altogether using the "neutron agent delete" command.
b. I changed the state to "UP" for the LinuxBridge agents by using the "neutron agent set" command - which results in what we see above as the final correct result.

All of a sudden, things started working. VMs got IP addresses, and got the CORRECT DHCP addresses because the DHCP requests were getting all of the way through and the responses were coming back.

Now....the second issue.

The internal network was created as a LOCAL network. It turns out that DHCP does not seem to work in networks of this type in OpenStack. If you do defined DHCP on the LOCAL network, OpenStack will define an IP for it. But the VM does seem to get the IP because DHCP requests and offers don't seem to flow.

Keep in mind (see diagram above) that the local network is shared, but it is NOT directly connected to a router. This COULD be one reason why it is not getting a DHCP address. -- So - you can just set one, right? WRONG!!!

I set up a couple of ports on this network, with FIXED IPs. And if the VMs are provisioned using these FIXED IPs - you cannot just arbitrarily add some other IP on the same segment to that interface and expect traffic to work. It won't. In other words, the IP needs to be the one OpenStack expects it to be. Period.

This is the same case for DHCP. See even though the DHCP does not seem to work on the LOCAL isolated network, if you do define a DHCP range on that network, OpenStack will reserve an address for the VM.  But the VM may not get the address. So you can assign it (i.e. manually, using iproute2 facility), but if you do not choose the same one that OpenStack has chosen for it, traffic will not work.  So your VM needs to somehow KNOW what address has been reserved for a specific VM in this case - and that's not easy to do.

Therfore, the solution seems to be to use a Port. That way the instantiator chooses the IP, and can set it for the VM if it does not get set through OpenStack (which I believe uses CloudInit to set that IP - though I should double check this to be sure).

Sunday, December 31, 2017

Keystone Identity Service on OpenStack: Database url issue

I remember one time trying to debug a problem with the setup of keystone.

It took me FOREVER to figure it out.

It was when I was, as part of set up procedure, running the script:
su -s /bin/sh -c "keystone-manage db_sync" keystone

This script will exit silently with a "1" code if it does not run. You MUST check the $? in bash to make sure the damned thing ran.

When I saw the "1" code, I went and checked the keystone log, which said:

2017-12-31 23:28:21.807 13029 CRITICAL keystone [-] NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:mysql.pymsql
2017-12-31 23:28:21.807 13029 ERROR keystone Traceback (most recent call last):
2017-12-31 23:28:21.807 13029 ERROR keystone   File "/bin/keystone-manage", line 10, in <module>
2017-12-31 23:28:21.807 13029 ERROR keystone     sys.exit(main())
2017-12-31 23:28:21.807 13029 ERROR keystone   File "/usr/lib/python2.7/site-packages/keystone/cmd/manage.py", line 44, in main
2017-12-31 23:28:21.807 13029 ERROR keystone     cli.main(argv=sys.argv, config_files=config_files)
2017-12-31 23:28:21.807 13029 ERROR keystone   File "/usr/lib/python2.7/site-packages/keystone/cmd/cli.py", line 1312, in main
...

I started looking at all of these packages I'd installed, checking them (they were all there). I then went in search of help on google. And yes, the message was there, but no help fixing it.

I then realized...the URL was wrong in the file. Problem is, the naked eye can't seem to spot this error:

Incorrect URL:
connection = mysql+pymsql://keystone:KEYSTONE_DBPASS@controller/keystone

Correct URL:
connection = mysql+pymysql://keystone:KEYSTONE_DBPASS@controller/keystone

Optically, very hard to see that missing second "y" in "mysql" because of the "py". Fortunately, I lost only 30 minutes on this issue this time. Earlier, when installing Newton, I lost an entire day or more.

Tuesday, December 19, 2017

Port Binding Failures on OpenStack - How I fixed this


In trying to set up OpenStack for a colleague, I had an issue where I could not get the ports to come up. The port status for the networks would be in a state of "down", and I could not ping these ports from within the network namespaces, or from the router ports.

Unfortunately, I did not have DEBUG turned on in the logs. So I saw no output from nova or neutron about any kind of issues with networking or port bindings.

I did enable DEBUG, at which point I started to see "DEBUG" messages (not "ERROR" messages mind you, but "DEBUG" messages) about port binding failures. Finding these port binding debug messages was like looking for a needle in a haystack as there is a ton of debug output when you enable debug in Nova and Neutron.

I had a very very difficult time figuring out what was causing this.  But here is how I fixed it:

1. I watched a segment on YouTube about ml2 and Neutron. This was an Auction OpenStack Summit, and the url is here:

https://www.youtube.com/watch?v=e38XM-QaA5Q

2. I quickly realized that host names were such an integral part of port binding, that it was necessary to check the agents, the host names of those agents in Neutron, and the host names stored in mysql.

In MySQL, the neutron database has a table called agents, and all agents are mapped to a host.  That host needs to be correct and resolvable.

In the end, I wound up deleting some old agents that were no longer being used (old hosts, and some openvswitch agents that were lingering from a previous switch from linuxbridge to openvswitch).  I then had to correct some hostnames because I had my OpenStack Controller and Network node in a VM and I had recycled that VM for my colleague - who had reassigned a new hostname for the VM on his platform.


Then, just to be thorough, I deleted all agents (i.e. DHCP agents), then deleted all subnets, then deleted all networks. I then re-created these - WITH NEW NAMES (so as to ensure that OpenStack wasn't re-using old ones); in order. First I created the networks, then I created the subnets, then I created the agents (which generally create the ports themselves). Lastly, I mapped these new subnets to the router as interfaces (which creates ports).

One thing that is EXTREMELY important, is that ports bind to the PHYSICAL network...not the Virtual network. 

If you create a external provider network called "provider", and the physical network is called "physical", and then you go into ml2_conf.ini and linuxbridge.ini and use "provider" instead of "physical" in your bindings, you will most assuredly result in a port binding failure.

So these are the tips and tricks to solving the port binding issue, or configuring properly ahead of time so that you don't run into the port binding issue.

Wednesday, December 13, 2017

OpenStack Compute Node - state path and lock file directories

I may have posted on this topic before, because I ran into this issue before.

I was setting up OpenStack for a colleague of mine, and had all sorts of issues getting it to work.

A couple of problems I had were related to services that were not enabled, and when you rebooted the unit(s), the services were not starting up. These were easy to fix.

The difficult issue to find and fix - which took me almost a full business day - had to do with how OpenStack Nova configures itself.

In the /etc/nova.conf file, there is a variable called a state_path. This variable is set to /var/lib/nova - a directory nova creates upon installation and sets permissions to the nova user and group.

In this directory, is a subdirectory called "instances", where Nova puts running instances.

The problem, is that Nova on installation does not seem to check or care about partition and file system sizes. It just assumes.

The issue we had, was that on a CentOS7 default installation, the /var directory is part of the root file system, which is very small (15-20 Gb), as it should normally be (you generally separate root file system from apps and data).

When you would start Nova, even in debug mode, you never saw an ERROR about the fact that Nova had issues with any of its filters (disk, ram, compute, et al). They were being written into the log as DEBUG and WARNING. This made finding the problem like finding a needle in a haystack. And you only saw this evidence after enabling debug in the /etc/nova.conf file.

Eventually, after enabling debug and combing through the logs (on both Controller as well as Compute node), we found a message on the Controller node (NOT THE COMPUTE NODE WHERE YOU WOULD EXPECT IT TO BE) about the disk filter returning 0/1 hosts.

So - we moved the /var/lib/nova to the /home/nova directory (which had hundreds of Gb). We also changed the home directory of nova in /etc/passwd to /home/nova (from /var/lib/nova).

We got further...but it was STILL FAILING.

Further debugging indicated that when we moved the directory, we forgot about another variable in /etc/nova.conf called the lock_file_path. Used for RabbitMQ communication, this variable was still pointing to a lock file in /var/lib directory (that had been moved to /home). This caused Compute Filter issues - also showing up as DEBUG and WARNING messages and not errors.

Ubiquiti EdgeRouter X - Power Over Ethernet (POE)

I tested the POE feature of the router this week. I will cover what I did, and the results.

I had a Polycom VOIP phone that was POE-enabled, and I did not have the power supply for it; hence the use case, and excitement about leveraging this feature when I started using this router.

By default, this router uses eth0 as the Management port (192.168.1.0/24), and eth1 as the WAN / Internet port.

This assumes that only a single ISP is being used with the router which is a good assumption considering this is a small consumer appliance and not an Enterprise router.  The router actually supports dual ISPs (link aggregation), which is certainly an Enterprise feature, but to employ this feature you need to change the ports such that you are using eth0 (leftmost port) and eth4 (rightmost port).

The POE feature on this router is POE Passthrough, which means that you need to insert a POE-enabled device into the "input port" - which happens to be eth0 (leftmost port), and POE output will be supplied on the passthrough port eth4.


  1. First, I checked the line to make sure it had POE, by connecting it to the phone directly. Voila', it powered up. So, knowing POE was indeed enabled on the input CAT-5 from the patch panel, I set about trying it on the router.
  2. Next, I had to reconfigure the router so that eth0 was the WAN port. This meant removing (or moving) the Management port. Since a Management port is handy to have, I made eth1 the Management Port - essentially switching eth0 and eth1. I will skip the details on how to do this, but basically you can do this through the menu system of the device by going to Dashboard, Services and Firewall.
  3. I then connected the POE VOIP phone to eth4 - directly. A common mistake is to use a switch, but if the switch is not a POE switch, this won't work (you can also burn up the switch this way). I made sure that eth4 was on the Switch! For some reason, only eth2 and eth3 were on the switch (boxes ticked). I ticked the eth4 box.
  4. On the eth4 POE Output interface is where you enable POE. I was surprised you could not enable it on eth0 input link. I enabled this. Keep in mind the router is STILL PLUGGED IN! 
The POE VOIP phone did not power up.  One theory is that the router could not power up the phone on its own, based on just the voltage from the input CAT-5 (keep in mind it also needs to power the router itself). We then powered up the router with the adaptor, thinking this might supply more juice (voltage) to power up the phone. Nope.

I did not have more time to debug this. I disabled POE on eth4, and used a POE Adaptor, which works just fine. Maybe I can attempt this later, but at first test, POE did work. For me. For my phone. This is the first time I have actually used POE.



Thursday, December 7, 2017

OpenVSwitch Round III

Turns out that when I rebooted my boxes, nothing worked. I debugged the issue back to OpenVSwitch.

The problem is definitely the provider bridge on the Controller / Network node.

I *think* the problem is that I am using only a single interface on the Virtual Machine, and using that same interface for two purposes:

1. vxlan tunneling to OpenStack virtual machines (via the br-tun OpenVSwitch managed bridge.)
2. internet connectivity (via the br-provider OpenVSwitch managed bridge)

I have to verify this, but that is the hunch.

I have backed out OpenVSwitch once again, reverted back to linuxbridge-agent, and more research to follow.

NOTE: To use two interfaces on the VM means quite a bit of work. The interface ens3 on the VM connects to a bridge that connects to an adaptor on the host. I probably need to use two adaptors on the host, and I probably need additional network(s) created, which has routing implications.

Essentially it's a network redesign - or, we could say, expansion.


SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...