We had a deployment where two customer VMs were deployed as an Active Standby cluster. And the failover wasn't working when they tested it.
I had already deployed a fully working pair of Active-Standby Virtual Machines using KeepaliveD, so I knew that VRRP worked. Now, I am not sure that the customer is using VRRP per se, but the concept of Active Standby failover remains a constant whether both of us were using a strict RFC-compliant VRRP or not.
So what was the difference between these customer VMs, and our VMs?
Well, the difference was that I was running my VMs on VLAN-backed network segments that were jacked into (legacy) vCenter / ESXi Distributed Port Groups. The customer's VMs, were jacked into NSX-T virtual switches (overlay segments).
So after re-verifying my VRRP failover (which worked flawlessly in both multicast and unicast peering configurations), the problem seemed to be traced back to NSX-T.
Was it Mac Spoofing? Was it a Firewall? NSX-T does run an Overlay Firewall! And these Firewalls are at the segment level, but also the Transport Zone (Tier 1 router) level. Sure enough, we realized that the Tier 1 Firewall was dropping packets on failover attempts.
After much testing, it was concluded that it was related to TOFU on the IP Discovery Switching Profile.
From this VMWare link, we get some insight on this:
Understanding IP Discovery Switching Profile
By default, the discovery methods ARP snooping and ND snooping operate in a mode called trust on first use (TOFU). In TOFU mode, when an address is discovered and added to the realized bindings list, that binding remains in the realized list forever. TOFU applies to the first 'n' unique <IP, MAC, VLAN> bindings discovered using ARP/ND snooping, where 'n' is the binding limit that you can configure. You can disable TOFU for ARP/ND snooping. The methods will then operate in trust on every use (TOEU) mode. In TOEU mode, when an address is discovered, it is added to the realized bindings list and when it is deleted or expired, it is removed from the realized bindings list. DHCP snooping and VM Tools always operate in TOEU mode.
So guess what? After disabling this profile, and effectively disabling TOFU mode, TOEU mode kicked in and lo and behold, the customer's failover started working.
No comments:
Post a Comment