Friday, January 19, 2018

OpenStack Networking - More Learnings

For a while now, OpenStack has been working just fine for the most part, except for some of the issues trying to switch back and forth between OpenVSwitch and LinuxBridge agents (earlier posts discuss some of these issues).

This week, I ran into some issues trying to use OpenStack with virtual machines that use MULTIPLE interfaces, as opposed to just a single eth0 interface.

This post will discuss some of those issues.

The first issue I ran into, happened when I added a 2nd and 3rd router to my existing network.

The diagram below depicts the architecture I had designed. The router on the top and bottom were the new routers, and the orange, red and brown networks were newly added networks.
Adding second and third routers to an OpenStack network



The first issue I ran into, was related to the fact that IMMEDIATELY, the DHCP seemed to stop working on any instantiated networks. Stranger than that, the IP assignment behavior became very strange. The VMs might (or might) not get an IP at all, or they might get an IP on their own or each other's network segments. But the IP was NEVER in the DHCP defined range.

I always use a range of .11 to .199 for all /24 networks, just to ensure that the handout of IPs is correct. This paid off, because in situations where the LAN segment was correct, I still found that VMs were getting a .3 address.

So I knew DHCP was not working. I finally realized that the ".3" addresses the VMs were getting, was a Floating IP that OpenStack hands an instance when DHCP is not working - sort of like a default.

So why was DHCP not working?

The answer - after considerable debugging - turned out to be the "agent" function in OpenStack.

If you are running LinuxBridgeAgent, you should see something that looks like this when you run the "neutron agent list" command:

Proper listing of Alive and State values on a Neutron Agent Listing in OpenStack
 Keep in mind, that there are two fields here to consider. One is the "Alive" field, and the other one is the "State" field.

Somehow, because I had at one time been running OpenVSwitch on this controller, I had entries in this list for OpenVSwitch agents. And I found that while the Alive value for those was DOWN, the STATE field was set to UP!!!!

Meanwhile, the LinuxBridge Agents were in a state of UP, but the STATE field was set to DOWN!!!

These were reversed! No wonder nothing was working. I figured this out after tracing ping traffic through the tap interfaces, only to find that the pings were disappearing at the router.

I wound up doing two things to fix this:
a. I deleted the OpenVSwitch entries altogether using the "neutron agent delete" command.
b. I changed the state to "UP" for the LinuxBridge agents by using the "neutron agent set" command - which results in what we see above as the final correct result.

All of a sudden, things started working. VMs got IP addresses, and got the CORRECT DHCP addresses because the DHCP requests were getting all of the way through and the responses were coming back.

Now....the second issue.

The internal network was created as a LOCAL network. It turns out that DHCP does not seem to work in networks of this type in OpenStack. If you do defined DHCP on the LOCAL network, OpenStack will define an IP for it. But the VM does seem to get the IP because DHCP requests and offers don't seem to flow.

Keep in mind (see diagram above) that the local network is shared, but it is NOT directly connected to a router. This COULD be one reason why it is not getting a DHCP address. -- So - you can just set one, right? WRONG!!!

I set up a couple of ports on this network, with FIXED IPs. And if the VMs are provisioned using these FIXED IPs - you cannot just arbitrarily add some other IP on the same segment to that interface and expect traffic to work. It won't. In other words, the IP needs to be the one OpenStack expects it to be. Period.

This is the same case for DHCP. See even though the DHCP does not seem to work on the LOCAL isolated network, if you do define a DHCP range on that network, OpenStack will reserve an address for the VM.  But the VM may not get the address. So you can assign it (i.e. manually, using iproute2 facility), but if you do not choose the same one that OpenStack has chosen for it, traffic will not work.  So your VM needs to somehow KNOW what address has been reserved for a specific VM in this case - and that's not easy to do.

Therfore, the solution seems to be to use a Port. That way the instantiator chooses the IP, and can set it for the VM if it does not get set through OpenStack (which I believe uses CloudInit to set that IP - though I should double check this to be sure).

No comments:

Fixing Clustering and Disk Issues on an N+1 Morpheus CMP Cluster

I had performed an upgrade on Morpheus which I thought was fairly successful. I had some issues doing this upgrade on CentOS 7 because it wa...