Tuesday, December 5, 2017

OpenStack and OpenVSwitch - Round II


Yesterday I installed OpenVSwitch in conjunction with OpenStack and was excited to see that it worked - first time - after I installed and configured it.

To review, there are two compute nodes. These are NOT virtual machines; they are kvm / libvirtd machines (each kvm host is a compute node).  And there is one OpenStack Controller that IS a virtual machine, and this virtual machine also serves as the OpenStack Network node (it is common to see the Controller and Network node separated, but I run them on the same VM).

The two OpenStack Compute Nodes appear to be working perfectly. Each of these has two bridges:
1. br-tun - used for tunneling to the controller
2. br-int - used as an integration hub (virtual patch cords between bridges)

The Controller node, however, is a bit different in that it needs to have an additional bridge called br-provider. This bridge connects out to the internet.

I am taking this architecture STRAIGHT out of the OpenStack documentation (I am running Newton but these diagrams are also identical and present in the Ocata release of OpenStack), which is at this website: https://docs.openstack.org/newton/networking-guide/deploy-ovs-selfservice.html

Below is a diagram - their diagram - shown here for convenience - to illustrate what I am trying to achieve:


Now in the OpenStack Install Guide, for Newton, they don't have you creating this "Network Node". All of the stuff on a Network Node is on the Controller Node. So the Network Node and Controller node are combined. While I am considering breaking the Network Node out into a separate VM or on a separate small box, currently my Controller and Network Node are all on the same CentOS 7 virtual machine running on a libvirtd on one of the Compute Nodes - external to OpenStack though.

Having it this way can be difficult in some ways. For instance, I have to run an OpenVSwitch Agent on this Network Node - and this is not well-documented (documentation leads you to believe you only need the linuxbridge agent or openvswitch agent on the Compute Nodes). Another issue with this is that there are ports (tap) for every dhcp agent, and every instance, which gets confusing when you look at your OpenVSwitch and see all of these, with these cryptic names beginning in qr, qq, etc. You don't always know which is which and where to start debugging.

Anyway...I had a VERY difficult time getting OpenVSwitch to work. The first issue I had was that the bridges in OpenVSwitch did not seem to have the right interfaces. This was a configuration issue. Another issue had to do with the br-provider bridge. I actually thought I should put those on the Compute Nodes even though the diagram did not have those. I didn't understand why the Compute Nodes could not go straight out - and why they needed to be tunneled over to the Controller / Network nodes, which seemed like a huge unnecessary performance hit.

But the MAIN issue I had was that once configured and after I instantiated virtual machine instances, they would come up but without an IP Address. I set the instantiated VM's IP Address statically, and lo and behold, it could communicate just fine. This told me that the issue was DHCP; that the VMs were not getting an IP Address via DHCP. At least I knew, but this concerned me because it did not seem like it would be easy to debug or fix (it wasn't).

2. I checked the dhcp agent services. They were running. I checked the logs. They looked okay. I also did the same for the l3 agent.

3. I then mentioned this to my boss, and he suggested that the OpenVSwitch was not handling broadcast messages. He suggested I trace the broadcast by checking the endpoints and working in from there.

This led to me having to go into the namespaces with "ip netns exec". What I found was that the interfaces inside of the namespaces had "@" characters in the names, making it difficult to run tcpdump.

I finally ran across the ovs-testing package, which has some utilities such as ovs-tcpdump that help debug bridges and interfaces inside the switch.  When I did this, I found the broadcast traffic reaching the OpenStack Controller.

Finally - I realized that I needed to DELETE the dhcp agents and RECREATE them. The original agents were created when the networks were created, using the linuxbridge. Apparently switching to OpenVSwitch rendered these agents inoperable. They could not "adapt" to that change. So - I deleted those dhcp agents, OpenStack recreated them, and now things are working just fine.

No comments:

Fixing Clustering and Disk Issues on an N+1 Morpheus CMP Cluster

I had performed an upgrade on Morpheus which I thought was fairly successful. I had some issues doing this upgrade on CentOS 7 because it wa...