Grasping Technology: December 2017

Sunday, December 31, 2017

Keystone Identity Service on OpenStack: Database url issue

I remember one time trying to debug a problem with the setup of keystone.

It took me FOREVER to figure it out.

It was when I was, as part of set up procedure, running the script:
su -s /bin/sh -c "keystone-manage db_sync" keystone

This script will exit silently with a "1" code if it does not run. You MUST check the $? in bash to make sure the damned thing ran.

When I saw the "1" code, I went and checked the keystone log, which said:

2017-12-31 23:28:21.807 13029 CRITICAL keystone [-] NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:mysql.pymsql
2017-12-31 23:28:21.807 13029 ERROR keystone Traceback (most recent call last):
2017-12-31 23:28:21.807 13029 ERROR keystone File "/bin/keystone-manage", line 10, in <module>
2017-12-31 23:28:21.807 13029 ERROR keystone sys.exit(main())
2017-12-31 23:28:21.807 13029 ERROR keystone File "/usr/lib/python2.7/site-packages/keystone/cmd/manage.py", line 44, in main
2017-12-31 23:28:21.807 13029 ERROR keystone cli.main(argv=sys.argv, config_files=config_files)
2017-12-31 23:28:21.807 13029 ERROR keystone File "/usr/lib/python2.7/site-packages/keystone/cmd/cli.py", line 1312, in main
...

I started looking at all of these packages I'd installed, checking them (they were all there). I then went in search of help on google. And yes, the message was there, but no help fixing it.

I then realized...the URL was wrong in the file. Problem is, the naked eye can't seem to spot this error:

Incorrect URL:
connection = mysql+pymsql://keystone:KEYSTONE_DBPASS@controller/keystone

Correct URL:
connection = mysql+pymysql://keystone:KEYSTONE_DBPASS@controller/keystone

Optically, very hard to see that missing second "y" in "mysql" because of the "py". Fortunately, I lost only 30 minutes on this issue this time. Earlier, when installing Newton, I lost an entire day or more.

Tuesday, December 19, 2017

Port Binding Failures on OpenStack - How I fixed this

In trying to set up OpenStack for a colleague, I had an issue where I could not get the ports to come up. The port status for the networks would be in a state of "down", and I could not ping these ports from within the network namespaces, or from the router ports.

Unfortunately, I did not have DEBUG turned on in the logs. So I saw no output from nova or neutron about any kind of issues with networking or port bindings.

I did enable DEBUG, at which point I started to see "DEBUG" messages (not "ERROR" messages mind you, but "DEBUG" messages) about port binding failures. Finding these port binding debug messages was like looking for a needle in a haystack as there is a ton of debug output when you enable debug in Nova and Neutron.

I had a very very difficult time figuring out what was causing this. But here is how I fixed it:

1. I watched a segment on YouTube about ml2 and Neutron. This was an Auction OpenStack Summit, and the url is here:

https://www.youtube.com/watch?v=e38XM-QaA5Q

2. I quickly realized that host names were such an integral part of port binding, that it was necessary to check the agents, the host names of those agents in Neutron, and the host names stored in mysql.

In MySQL, the neutron database has a table called agents, and all agents are mapped to a host. That host needs to be correct and resolvable.

In the end, I wound up deleting some old agents that were no longer being used (old hosts, and some openvswitch agents that were lingering from a previous switch from linuxbridge to openvswitch). I then had to correct some hostnames because I had my OpenStack Controller and Network node in a VM and I had recycled that VM for my colleague - who had reassigned a new hostname for the VM on his platform.

Then, just to be thorough, I deleted all agents (i.e. DHCP agents), then deleted all subnets, then deleted all networks. I then re-created these - WITH NEW NAMES (so as to ensure that OpenStack wasn't re-using old ones); in order. First I created the networks, then I created the subnets, then I created the agents (which generally create the ports themselves). Lastly, I mapped these new subnets to the router as interfaces (which creates ports).

One thing that is EXTREMELY important, is that ports bind to the PHYSICAL network...not the Virtual network.

If you create a external provider network called "provider", and the physical network is called "physical", and then you go into ml2_conf.ini and linuxbridge.ini and use "provider" instead of "physical" in your bindings, you will most assuredly result in a port binding failure.

So these are the tips and tricks to solving the port binding issue, or configuring properly ahead of time so that you don't run into the port binding issue.

Wednesday, December 13, 2017

OpenStack Compute Node - state path and lock file directories

I may have posted on this topic before, because I ran into this issue before.

I was setting up OpenStack for a colleague of mine, and had all sorts of issues getting it to work.

A couple of problems I had were related to services that were not enabled, and when you rebooted the unit(s), the services were not starting up. These were easy to fix.

The difficult issue to find and fix - which took me almost a full business day - had to do with how OpenStack Nova configures itself.

In the /etc/nova.conf file, there is a variable called a state_path. This variable is set to /var/lib/nova - a directory nova creates upon installation and sets permissions to the nova user and group.

In this directory, is a subdirectory called "instances", where Nova puts running instances.

The problem, is that Nova on installation does not seem to check or care about partition and file system sizes. It just assumes.

The issue we had, was that on a CentOS7 default installation, the /var directory is part of the root file system, which is very small (15-20 Gb), as it should normally be (you generally separate root file system from apps and data).

When you would start Nova, even in debug mode, you never saw an ERROR about the fact that Nova had issues with any of its filters (disk, ram, compute, et al). They were being written into the log as DEBUG and WARNING. This made finding the problem like finding a needle in a haystack. And you only saw this evidence after enabling debug in the /etc/nova.conf file.

Eventually, after enabling debug and combing through the logs (on both Controller as well as Compute node), we found a message on the Controller node (NOT THE COMPUTE NODE WHERE YOU WOULD EXPECT IT TO BE) about the disk filter returning 0/1 hosts.

So - we moved the /var/lib/nova to the /home/nova directory (which had hundreds of Gb). We also changed the home directory of nova in /etc/passwd to /home/nova (from /var/lib/nova).

We got further...but it was STILL FAILING.

Further debugging indicated that when we moved the directory, we forgot about another variable in /etc/nova.conf called the lock_file_path. Used for RabbitMQ communication, this variable was still pointing to a lock file in /var/lib directory (that had been moved to /home). This caused Compute Filter issues - also showing up as DEBUG and WARNING messages and not errors.

Ubiquiti EdgeRouter X - Power Over Ethernet (POE)

I tested the POE feature of the router this week. I will cover what I did, and the results.

I had a Polycom VOIP phone that was POE-enabled, and I did not have the power supply for it; hence the use case, and excitement about leveraging this feature when I started using this router.

By default, this router uses eth0 as the Management port (192.168.1.0/24), and eth1 as the WAN / Internet port.

This assumes that only a single ISP is being used with the router which is a good assumption considering this is a small consumer appliance and not an Enterprise router. The router actually supports dual ISPs (link aggregation), which is certainly an Enterprise feature, but to employ this feature you need to change the ports such that you are using eth0 (leftmost port) and eth4 (rightmost port).

The POE feature on this router is POE Passthrough, which means that you need to insert a POE-enabled device into the "input port" - which happens to be eth0 (leftmost port), and POE output will be supplied on the passthrough port eth4.

First, I checked the line to make sure it had POE, by connecting it to the phone directly. Voila', it powered up. So, knowing POE was indeed enabled on the input CAT-5 from the patch panel, I set about trying it on the router.
Next, I had to reconfigure the router so that eth0 was the WAN port. This meant removing (or moving) the Management port. Since a Management port is handy to have, I made eth1 the Management Port - essentially switching eth0 and eth1. I will skip the details on how to do this, but basically you can do this through the menu system of the device by going to Dashboard, Services and Firewall.
I then connected the POE VOIP phone to eth4 - directly. A common mistake is to use a switch, but if the switch is not a POE switch, this won't work (you can also burn up the switch this way). I made sure that eth4 was on the Switch! For some reason, only eth2 and eth3 were on the switch (boxes ticked). I ticked the eth4 box.
On the eth4 POE Output interface is where you enable POE. I was surprised you could not enable it on eth0 input link. I enabled this. Keep in mind the router is STILL PLUGGED IN!

The POE VOIP phone did not power up. One theory is that the router could not power up the phone on its own, based on just the voltage from the input CAT-5 (keep in mind it also needs to power the router itself). We then powered up the router with the adaptor, thinking this might supply more juice (voltage) to power up the phone. Nope.

I did not have more time to debug this. I disabled POE on eth4, and used a POE Adaptor, which works just fine. Maybe I can attempt this later, but at first test, POE did work. For me. For my phone. This is the first time I have actually used POE.

Thursday, December 7, 2017

OpenVSwitch Round III

Turns out that when I rebooted my boxes, nothing worked. I debugged the issue back to OpenVSwitch.

The problem is definitely the provider bridge on the Controller / Network node.

I *think* the problem is that I am using only a single interface on the Virtual Machine, and using that same interface for two purposes:

1. vxlan tunneling to OpenStack virtual machines (via the br-tun OpenVSwitch managed bridge.)
2. internet connectivity (via the br-provider OpenVSwitch managed bridge)

I have to verify this, but that is the hunch.

I have backed out OpenVSwitch once again, reverted back to linuxbridge-agent, and more research to follow.

NOTE: To use two interfaces on the VM means quite a bit of work. The interface ens3 on the VM connects to a bridge that connects to an adaptor on the host. I probably need to use two adaptors on the host, and I probably need additional network(s) created, which has routing implications.

Essentially it's a network redesign - or, we could say, expansion.

What to do if your VM won't boot

After I made a bunch of changes to OpenVSwitch (see last post), I rebooted my VM and it just hung with a "random init crng" message. The last thing it did was bring up a network interface ens3 before this message appeared. I assumed it was trying to start networking, and was waiting for a timeout.

I waited for what seemed like an eternity...5 minutes? 10? Longer? And a prompt came up. I saw little to nothing in the logs, so I became concerned. How would I debug this if I have to wait that long every time I reboot?

I found a few nifty tricks to get you out of a jam if you're virtual machine won't boot.

First, if the VM will come up at all, you can enable "system-debug" with systemctl:
# systemctl enable system-debug

This service allows you to connect on tty9 to the VM as it is booting. On a virtual machine, you typically need to "send" a command to get to this terminal. So on KVM, there is a "Send Command" option and you can send keystroke commands to the VM (i.e. Ctl-Alt-Backspace, Ctl-Alt-F1...F9, et al). This proved to be quite handy, and a sense of relief that I could actually get inside of the VM - especially if you have not snapshotted it or backed it up.

If you need to do more, or go further, you may need to interrupt the boot process and edit the line in the grub config file. For Red Hat users, here is a link to a page that gives some direction on how to get into debug mode for a virtual machine.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sec-terminal_menu_editing_during_boot

Tuesday, December 5, 2017

OpenStack and OpenVSwitch - Round II

Yesterday I installed OpenVSwitch in conjunction with OpenStack and was excited to see that it worked - first time - after I installed and configured it.

To review, there are two compute nodes. These are NOT virtual machines; they are kvm / libvirtd machines (each kvm host is a compute node). And there is one OpenStack Controller that IS a virtual machine, and this virtual machine also serves as the OpenStack Network node (it is common to see the Controller and Network node separated, but I run them on the same VM).

The two OpenStack Compute Nodes appear to be working perfectly. Each of these has two bridges:
1. br-tun - used for tunneling to the controller
2. br-int - used as an integration hub (virtual patch cords between bridges)

The Controller node, however, is a bit different in that it needs to have an additional bridge called br-provider. This bridge connects out to the internet.

I am taking this architecture STRAIGHT out of the OpenStack documentation (I am running Newton but these diagrams are also identical and present in the Ocata release of OpenStack), which is at this website: https://docs.openstack.org/newton/networking-guide/deploy-ovs-selfservice.html

Below is a diagram - their diagram - shown here for convenience - to illustrate what I am trying to achieve:

Now in the OpenStack Install Guide, for Newton, they don't have you creating this "Network Node". All of the stuff on a Network Node is on the Controller Node. So the Network Node and Controller node are combined. While I am considering breaking the Network Node out into a separate VM or on a separate small box, currently my Controller and Network Node are all on the same CentOS 7 virtual machine running on a libvirtd on one of the Compute Nodes - external to OpenStack though.

Having it this way can be difficult in some ways. For instance, I have to run an OpenVSwitch Agent on this Network Node - and this is not well-documented (documentation leads you to believe you only need the linuxbridge agent or openvswitch agent on the Compute Nodes). Another issue with this is that there are ports (tap) for every dhcp agent, and every instance, which gets confusing when you look at your OpenVSwitch and see all of these, with these cryptic names beginning in qr, qq, etc. You don't always know which is which and where to start debugging.

Anyway...I had a VERY difficult time getting OpenVSwitch to work. The first issue I had was that the bridges in OpenVSwitch did not seem to have the right interfaces. This was a configuration issue. Another issue had to do with the br-provider bridge. I actually thought I should put those on the Compute Nodes even though the diagram did not have those. I didn't understand why the Compute Nodes could not go straight out - and why they needed to be tunneled over to the Controller / Network nodes, which seemed like a huge unnecessary performance hit.

But the MAIN issue I had was that once configured and after I instantiated virtual machine instances, they would come up but without an IP Address. I set the instantiated VM's IP Address statically, and lo and behold, it could communicate just fine. This told me that the issue was DHCP; that the VMs were not getting an IP Address via DHCP. At least I knew, but this concerned me because it did not seem like it would be easy to debug or fix (it wasn't).

2. I checked the dhcp agent services. They were running. I checked the logs. They looked okay. I also did the same for the l3 agent.

3. I then mentioned this to my boss, and he suggested that the OpenVSwitch was not handling broadcast messages. He suggested I trace the broadcast by checking the endpoints and working in from there.

This led to me having to go into the namespaces with "ip netns exec". What I found was that the interfaces inside of the namespaces had "@" characters in the names, making it difficult to run tcpdump.

I finally ran across the ovs-testing package, which has some utilities such as ovs-tcpdump that help debug bridges and interfaces inside the switch. When I did this, I found the broadcast traffic reaching the OpenStack Controller.

Finally - I realized that I needed to DELETE the dhcp agents and RECREATE them. The original agents were created when the networks were created, using the linuxbridge. Apparently switching to OpenVSwitch rendered these agents inoperable. They could not "adapt" to that change. So - I deleted those dhcp agents, OpenStack recreated them, and now things are working just fine.

Grasping Technology