Friday, January 13, 2023

Debugging Dropped Packets on NSX-T E-NVDS

 

Inside the hypervisor, we have the following nics:

The servers have physical nics as follows:

  • vmnic0 – 1G nic, Intel X550 – Unused
  • vmnic1 – 1G nic, Intel X550 - Unused
  • vmnic2 - 10G nic, SFP+, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up
  • vmnic3 - 10G nic, SFP+, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up
  • vmnic4 - 10G nic, SFP+, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up
  • vmnic5 - 10G nic, SFP+, Intel XL710, driver version 2.1.5.0 FW version 8.50, link state up

The nics connect to the upstream switches (Aristas), and they connect virtually to the virtual switches (discussed right below):

 

Inside Hypervisor (Host 5 in this specific case):

Distributed vSwitch

                Physical Nic Side: vmnic2 and vmnic4

                Virtual Side: vmk0 (VLAN 3850) and vmk1 (VLAN 3853)

 

NSX-T Switch (E-NVDS)

                Physical NIC side: vmnic3 and vmnic5 à this is the nic that gets hit when we run the load tests

                Virtual Side: 50+ individual segments that VMs connect to, and get assigned a port

 

Now, in my previous email, I dumped the stats for the physical NIC – meaning, from the “NIC Itself” from the ESXi OS operating system.

 

But, it is wise also, to take a look at the stats of the physical nic from the perspective of the virtual switch! Remember, vmnic5 is a port on the virtual switch!

 

So first, we need to figure out what port we need to look at:
net-stats -l

PortNum          Type SubType SwitchName       MACAddress         ClientName

2214592527          4       0 DvsPortset-0     40:a6:b7:51:56:e9  vmnic3

2214592529          4       0 DvsPortset-0     40:a6:b7:51:1b:9d  vmnic5 à here we go, port 2214592529 on switch DvsPortset-0 is the port of interest

67108885            3       0 DvsPortset-0     00:50:56:65:96:e4  vmk10

67108886            3       0 DvsPortset-0     00:50:56:65:80:84  vmk11

67108887            3       0 DvsPortset-0     00:50:56:66:58:98  vmk50

67108888            0       0 DvsPortset-0     02:50:56:56:44:52  vdr-vdrPort

67108889            5       9 DvsPortset-0     00:50:56:8a:09:15  DEV-ISC1-Vanilla3a.eth0

67108890            5       9 DvsPortset-0     00:50:56:8a:aa:3f  DEV-ISC1-Vanilla3a.eth1

67108891            5       9 DvsPortset-0     00:50:56:8a:9d:b1  DEV-ISC1-Vanilla3a.eth2

67108892            5       9 DvsPortset-0     00:50:56:8a:d9:65  DEV-ISC1-Vanilla3a.eth3

67108893            5       9 DvsPortset-0     00:50:56:8a:fc:75  DEV-ISC1-Vanilla3b.eth0

67108894            5       9 DvsPortset-0     00:50:56:8a:7d:cd  DEV-ISC1-Vanilla3b.eth1

67108895            5       9 DvsPortset-0     00:50:56:8a:d4:d8  DEV-ISC1-Vanilla3b.eth2

67108896            5       9 DvsPortset-0     00:50:56:8a:67:6f  DEV-ISC1-Vanilla3b.eth3

67108901            5       9 DvsPortset-0     00:50:56:8a:32:1c  DEV-MSC1-Vanilla3b.eth0

67108902            5       9 DvsPortset-0     00:50:56:8a:e6:2b  DEV-MSC1-Vanilla3b.eth1

67108903            5       9 DvsPortset-0     00:50:56:8a:cc:eb  DEV-MSC1-Vanilla3b.eth2

67108904            5       9 DvsPortset-0     00:50:56:8a:7a:83  DEV-MSC1-Vanilla3b.eth3

67108905            5       9 DvsPortset-0     00:50:56:8a:63:55  DEV-MSC1-Vanilla3a.eth3

67108906            5       9 DvsPortset-0     00:50:56:8a:40:9c  DEV-MSC1-Vanilla3a.eth2

67108907            5       9 DvsPortset-0     00:50:56:8a:57:8f  DEV-MSC1-Vanilla3a.eth1

67108908            5       9 DvsPortset-0     00:50:56:8a:5b:6d  DEV-MSC1-Vanilla3a.eth0

 

/net/portsets/DvsPortset-0/ports/2214592529/> cat stats

packet stats {

   pktsTx:10109633317

   pktsTxMulticast:291909

   pktsTxBroadcast:244088

   pktsRx:10547989949 à total packets RECEIVED on vmnic5’s port on the virtual switch

   pktsRxMulticast:243731083

   pktsRxBroadcast:141910804

   droppedTx:228

   droppedRx:439933 à This is a lot more than the 3,717 Rx Missed errors, and probably accounts for why MetaSwitch sees more drops than we saw up to this point!

}

 

So – we have TWO things now to examine here.

  • Is the Receive Side Scaling configured properly and working? 
    • We configured it, but…we need to make sure it is working and working properly.
    • We don’t see all of the queues getting packets. Each Rx Queue should be getting its own CPU.
  • Once packets get into the Ring Buffer and passed through to the VM (poll mode driver picks the packets up off the Ring), they hit the virtual switch.
    • And the switch is dropping some packets.
    • Virtual switches are software. As such, they need to be tuned to stretch their capability to keep up with what legacy hardware switches can do.
      • The NSX-T switch is a powerful switch, but is also a newer virtual switch, more bleeding edge in terms of technology. 
      • I wonder if we are running the latest greatest version of this switch, and if that could help us here.

 

Now, I looked even deeper into the E-NVDS switch. I went into vsish shell, and started examining any and all statistics that are captured by that networking stack.

 

Since we are concerned with receives, I looked at the InputStats specifically. I noticed there are several filters – which, I presume is tied to a VMWare Packet Filtering flow, analogous to Netfilter in Linux, or perhaps Berkeley Packet Filter. But, I have no documentation whatsoever on this, and can’t find any, so I did my best to “back into” what I was seeing.

 

I see the following filters that packets can traverse – traceflow might be packet capture but not sure aside of that.

·         ens-slowpath-input

·         traceflow-Uplink-Input:0x43110ae01630
·         vdl2-uplink-in:0x431e78801dd0
·         UplinkDoSwLRO@vmkernel#nover
·         VdrUplinkInput

 

If we go down into the filters and print the stats out, most of the stats seem to line up (started=passed, etc) except this one, which has drops in it:

/net/portsets/DvsPortset-0/ports/2214592529/inputFilters/vdl2-uplink-in/> cat stats
packet stats {
   pktsIn:31879020
   pktsOut:24269629
   pktsDropped:7609391
}

/net/portsets/DvsPortset-0/ports/2214592527/inputFilters/vdl2-uplink-in/> cat stats
packet stats {
   pktsIn:24817038
   pktsOut:17952829
   pktsDropped:6864209
}
 

That seems like a lot of dropped packets to me (a LOT more than those Rx Missed errors), so this looks like something we need to work with VMWare on because if I understand these stats properly, this suggests an issue on the virtual switch more than the adaptor itself.

 

Another thing I saw, poking around, was this interesting looking WRONG_VNIC on passthrough status on vmnic3 and vmnic5, the two nics being used in the test here. I think we should maybe ask VMWare about this and run this down also.

 

/net/portsets/DvsPortset-0/ports/2214592527/> cat status
port {
   port index:15
   vnic index:0xffffffff
   portCfg:
   dvPortId:4dfdff37-e435-4ba4-bbff-56f36bcc0779
   clientName:vmnic3
   clientType: 4 -> Physical NIC
   clientSubType: 0 -> NONE
   world leader:0
   flags: 0x460a3 -> IN_USE ENABLED UPLINK DVS_PORT DISPATCH_STATS_IN DISPATCH_STATS_OUT DISPATCH_STATS CONNECTED
   Impl customized blocked flags:0x00000000
   Passthru status: 0x1 -> WRONG_VNIC
   fixed Hw Id:40:a6:b7:51:56:e9:
   ethFRP:frame routing {
      requested:filter {
         flags:0x00000000
         unicastAddr:00:00:00:00:00:00:
         numMulticastAddresses:0
         multicastAddresses:
         LADRF:[0]: 0x0
         [1]: 0x0
      }
      accepted:filter {
         flags:0x00000000
         unicastAddr:00:00:00:00:00:00:
         numMulticastAddresses:0
         multicastAddresses:
         LADRF:[0]: 0x0
         [1]: 0x0
      }
   }
   filter supported features: 0 -> NONE
   filter properties: 0 -> NONE
   rx mode: 0 -> INLINE
   tune mode: 2 -> invalid
   fastpath switch ID:0x00000000
   fastpath port ID:0x00000004
}

Tuesday, January 10, 2023

VMWare NSX-T Testing - Dropped Packets

We have been doing some performance testing with a voice system.

In almost all cases, these tests are failing. They are failing for two reasons:

  1. Rx Missed counters on the physical adaptors of the hypervisors that are used to send the test traffic. These adaptors are connected to the E-NVDS virtual switch on one side, and to an upstream Arista data center switch on the other.
  2. Dropped Packets - mostly media (RTP UDP), with less than 2% of the drops being RTCP traffic (TCP).

Lately, I used the"Performance Best Practices for VMWare vSphere 7.0" guide, as a method for trying to improve the dropped packets were seeing.

We attempted several things that were mentioned in this document:

 

  • ESxi NiC - enable Receive Side Scaling (RSS)
    • Actually, to be technical, we enabled DRSS (Default Queue RSS) rather than the RSS (NetQ RSS) which the i40en driver also supported for this Intel X710 adaptor.
  • LatencySensitivity=High - and we checked "Reserve all Memory" on the checkbox
  • Interrupt Coalescing
    • Disabling it, to see what affect disabling it had
    • Setting it from its rate-based scheme (the default, rbc) to static with 64 packets per interrupt

We didn't really see any noticeable improvement from the Receive Side Scaling or the Latency Sensitivity settings, which was a surprise, actually. We did see some perhaps minor improvement on the interrupt coalescing when we set it to static.

Thursday, November 10, 2022

Zabbix log file parsing - IPtables packet drops

I recently had to create some iptables rules for a clustered system, restricting traffic from RabbitMQ, Percona and other services to just the nodes participating in the cluster.  The details of this could be a blog post in and of itself.

I set up logging on my IPTables drops, which was initially for debugging. I had planned on removing those logs. But, once I got all the rules working perfectly, the amount of logging drops down to almost zero. The only thing that gets logged at that point, are things you would want to know about. So I decided to leave the logging for dropped packets in play (if you do this, make sure your logs wrap and manage your disk space!).

When you put up a firewall that governs not only inbound traffic but also outbound traffic, you learn a LOT about what your system is receiving and sending, and it is always interesting to see traffic being dropped. I had to continually investigate traffic that was being dropped (especially outbound), and whitelist those services once I discovered what they were. NTP, DNS, all the normal things, but also some services that don't come to mind easily.

Logging onto systems to see what traffic is being dropped is a pain. You might be interested and do it initially, but eventually you will get tired of having to follow an operational routine.

I decided to do something more proactive. Send dropped packets to Zabbix.

To do this, here are the steps:

1. Add 2 new items to the hosts that employed the IPTables rules.

  • Iptables Dropped Packet Inbound
  • Iptables Dropped Packet Outbound

2.  Configure your items 

Configure your rules properly, as shown below (example is Inbound drops)

Zabbix Log File Item


First, and this is important, note the Type. It is Zabbix agent (active), not the default "Zabbix agent". This is because a log file check requires an active agent, and will not work if Zabbix is configured as a passive agent. There are plenty of documents and blogs on the web that discuss the difference between Active and Passive.

A passive agent configuration means Zabbix will open the TCP connection and fetch back the measurements it wants on port 10050. An active agent, means that the monitored node will take the responsibility of initiating the connection to the Zabbix server, on port 10051. So passive is "over and back" (server to target and back to server), while active is "over" (target to server). There are firewall repercussions for sure, as the traffic flow is different source to destination, and the two approaches use different ports.

Note that the 2nd parameter on the key, is the search string in the log. This can be a regular expression if you want to grab only a portion of the full line. If you use a regular expression, you can put it in tickmarks. I imagine a lot of people don't get this 2nd parameter correct which leads to troubleshooting. And, I think many people don't realize that you don't need to be putting in carets and such, to get what you need. For instance, I initially had "^.*IPtables Input Dropped.*$" in this field, when I realized later I only needed "IPtables Input Dropped". You only need to be slicing and dicing with complex regular expressions, if you want the output to be trimmed up.

The third parameter to be concerned with, is the Type of Information. This is often overlooked. It needs to be Log. Overlook this, which is easy to do, and it won't work!

I went with an update interval of 1m (1 minute), and a history storage period of 30d (thirty days). This was to be conservative on space lest we get a ton of log activity.

Also note that the item is Disabled (Enabled unchecked). This is because - we can not actually turn this on yet! We need to configure the agent for Active! THEN, this item can be verified and tested.

3. Next, you need to configure your agent for Active.

I wasn't sure initially, if a Zabbix agent had to be one or the other (Active or Passive mutual exclusive). I had my agent initially set to Passive. In Passive Mode, Zabbix reaches out on an interval, and gathers up measurements. Passive is the default when you install a Zabbix agent. 

To configure your agent to be Active, the only thing I needed to do, was to put a Server=x.x.x.x address in the Active section of the zabbix_agentd.conf file in /etc/zabbix directory of the monitored VM.

Don't forget to restart the zabbix-agent service!

4. Next, you need to make sure the firewalls on both sides are allowing Zabbix

On the monitored node (running Zabbix Agent), I was running iptables - not FirewallD. So I had to add an iptables rules for port 10051. 

iptables -A OUTPUT -p tcp -d ${ZABBIX} -m tcp --dport 10051 -j ACCEPT

On the Zabbix server itself, which happens to be running FirewallD, we simply added port 10051 to the /etc/firewalld/services/zabbix.xml file:

 <?xml version="1.0" encoding="utf-8"?>
<service>
  <short>Zabbix</short>
  <description>Allow services for Zabbix server and agent</description>
  <port protocol="tcp" port="10050"/>
  <port protocol="tcp" port="10051"/>
</service>

But you are not done when you add this! Make sure you restart the firewall, which can be done by restarting the service, or, if you don't want a gap in coverage, you can "reload" the rules on a running firewall with:

# firewall-cmd --reload

which generates "success" if the firewall restarts properly with the newly added modifications.

5. Now it is time to go back and enable your items!

Go back to the Zabbix GUI, select Configuration-->Hosts, and choose the host(s) that you added your 2 new items for (IPtables-Input-Dropped, IPtables-Output-Dropped). Select Items on each one of these hosts, choose the item and click the Enabled checkbox.

6. Check for Data

Make sure you wait a bit, because the interval time we set is 1 minute. 

After a reasonable length of time (suggested 5 minutes), go into the Zabbix GUI and select Monitoring-->Latest Data. In the hosts field, put one of the hosts into the field to filter the items for that particular host.

Find the two items, which are probably at the last page (the end), in a section called "other", since manually added Items tend to be grouped in the "other" category. 

On the right hand side, you should see "History". When you click History, your log file entries show up in the GUI!

Friday, October 28, 2022

Moving a LVM file system to a new disk in Linux

I had to dive back into Linux disk partitioning, file systems, and volumes when I got an alert from Zabbix that a cluster of 3 VMs were running out of space. As the alert from Zabbix said disk space was greater than 88 percent, I grew concerned and took a look.

In the labs, we had 3 x CentOS7 Virtual Machines, each deployed with a 200G VMDK file.  But inside the VM, in the Linux OS, there were logical volumes (centos-root, centos-swap, centos-home) that were mounted as XFS file systems on a 30G partition. There was no separate volume for /var (centos-var). And /var was the main culprit of the disk space usage. 

The decision was made to put /var on a separate disk as a good practice, because the var file system was used to store large virtual machine images.

The following steps were taken to move the /var file system to the new disk:

1. Add new Disk in vCenter to VM - create new VMDK file (100G in this particular case)

2. If the disk is seen, a /dev/sdb will be present in the Linux OS of the virtual machine. We need to create a partition on it (/dev/sdb1).
 
# fdisk /dev/sdb

n is the option to create a new partition, then p for selecting primary, then a bunch of useless question for this case, like the partition number, first and last cylinder, just use the default options.
This will create a Linux primary partition, you will need to use the command t in order to change the partition type to 8e (Linux LVM).
Then w will write everything to the disk and exit from fdisk.
# fdisk -l /dev/sdb

Will return something like this:

Device Boot Start End Sectors Size Id Type
/dev/sdb1 2048 20971519 20969472 10G 8e Linux LVM

3. Add device to physical volume (this creates a partition)
# pvcreate /dev/sdb1

NOTE: to delete a device from a physical volume, use vgreduce first, then pvremove!
vgreduce centos /dev/sdb1
pvremove /dev/sdb1

4. display volume group
# vgdisplay

--- Volume group ---
VG Name centos
[... more detail …]

5. display physical volumes in volume group
 
pvdisplay -C --separator '  |  ' -o pv_name,vg_name

6. Extend the volume group so it can contain the new disk (partition)

# vgextend centos /dev/sdb1

You will get info like this:
VG Size 29.75 GiB
PE Size 4.00 MiB
Total PE 7617
Alloc PE / Size 5058/ 19.75 GiB
Free PE / Size 2559 / 10 Gib

7. Create new logical volume

NOTE: this command can be tricky. You either need to know extents, and semantics, or you can keep is simple. Such as:
# lvcreate -n var -l 100%FREE centos

8. Create file system - NOTE that XFS is the preferred type, not ext4!
# mkfs -t xfs /dev/centos/var

9. Mount the new target var directory as newvar
# mkdir /mnt/newvar
# mount /dev/centos/var /mnt/newvar

10. Copy the files

NOTE: Lots of issues can occur during this, depending on what directory you are copying (i.e. var is notorious because of run and lock dirs).

I found this command to work:
# cp -apxv /var/* /mnt/newvar

Another one people seem to like, is the rsync command, but this one below I attempted hung:
# rsync -avHPSAX /var/ /mnt/newvar/

11. You can do a diff, or try to, to see how sane the copy went:
# diff -r /var /mnt/newvar/

12. Update fstab for reboot
/dev/mapper/centos-var /var xfs defaults 0 0

Note that we used the logical volume centos-var here, not centos (the volume group). LVM calls the volumes centos-swap, centos-home, etc.

13. Move the old /var on old root file system
# mv /var /varprev

14.Rename current var, create a new folder and remount
# mkdir /var
# mount /var

15. Use the df command to bring all the mounts
# df -h | grep /dev/

16. Decide whether you want to remove the old var file system and reclaim that disk space.

NOTE: Do not do this until you’re damned sure the new one is working fine. I recommend rebooting the system, inspecting all services that need to be running, etc.  

Now, the only thing left to consider now, is that after we have moved /var to a new 100G VMDK disk, what do we do about the fact that we now have a 200G boot/swap/root disk that is only using a small fraction of 200G in space now? Well, shrinking disks is even MORE daunting, and is not the topic of this post. But, if I decide to reclaim some space, expect another post that documents how I tackled that effort (or attempted to). 

For now, no more alerts about running out of space on a root file system is good news, and this VM can now run peacefully for quite a while.

Tuesday, June 7, 2022

VMWare Network Debugging - Trex Load Generation and Ring Buffer Overflow

We began running Trex Traffic Generator testing, sending load to a couple of virtual machines running on ESXi vSphere-managed hypervisors, and are running into some major problems.

First, the Trex Traffic Generator:

  • Cent7 OS virtual machine
  • 3 ports
    • eth0 used for ssh connectivity and to run Trex and Trex Console (with screen utility)
    • eth1 for sending traffic (Trex will put this port into DPDK-mode so OS cannot see it)
    • eth2 for sending traffic (Trex will put this port into DPDK-mode so OS cannot see it)
  • 4 cores 
    • the VM actually has 6, but two are used for running OS and Trex Admin
    • Traffic Tests utilize 4 cores

Next, the Device(s) Under Test (DUT):

  1. Juniper vSRX which is a router VM (based on JUNOS but Berkeley Unix under the hood?)
  2. Standard CentOS7 Virtual Machine
     

We ran the stateless imix test, at 20% and 100% line utilization.

We noticed that the Trex VM was using 80-90% core usage in the test (Trex Stats from console), and was using 20-25% line utilization, sending 4Gbps per port (8Gbps total) to the DUT virtual machines.

On the receiving side, the router was only processing about 1/4 to 1/6 of the packets sent by Trex.  The Cent7 VM, also, could not receive more than about 3.5Gbps maximum.

So what is happening? This led us to a Deep Dive, into the VMWare Statistics.

By logging into the ESXi host that the receiving VM was running on, we could first fine out what virtual switch and port the VM interface was assigned to, by running:

# net-stats -l

This produces a list, like this:

PortNum          Type SubType SwitchName       MACAddress         ClientName
50331650            4       0 DvsPortset-0     40:a6:b7:51:18:60  vmnic4
50331652            4       0 DvsPortset-0     40:a6:b7:51:1e:fc  vmnic2
50331654            3       0 DvsPortset-0     40:a6:b7:51:1e:fc  vmk0
50331655            3       0 DvsPortset-0     00:50:56:63:75:bd  vmk1
50331663            5       9 DvsPortset-0     00:50:56:8a:af:c1  P6NPNFVNDPKVMA.eth1
50331664            5       9 DvsPortset-0     00:50:56:8a:cc:74  P6NPNFVNDPKVMA.eth2
50331669            5       9 DvsPortset-0     00:50:56:8a:e3:df  P6NPNFVNRIV0009.eth0
67108866            4       0 DvsPortset-1     40:a6:b7:51:1e:fd  vmnic3
67108868            4       0 DvsPortset-1     40:a6:b7:51:18:61  vmnic5
67108870            3       0 DvsPortset-1     00:50:56:67:c5:b4  vmk10
67108871            3       0 DvsPortset-1     00:50:56:65:2d:92  vmk11
67108873            3       0 DvsPortset-1     00:50:56:6d:ce:0b  vmk50
67108884            5       9 DvsPortset-1     00:50:56:8a:80:3c  P6NPNFVNDPKVMA.eth0

A couple of nifty commands, will show you the statistics:
# vsish -e get /net/portsets/DvsPortset-0/ports/50331669/clientStats
port client stats {
   pktsTxOK:115
   bytesTxOK:5582
   droppedTx:0
   pktsTsoTxOK:0
   bytesTsoTxOK:0
   droppedTsoTx:0
   pktsSwTsoTx:0
   droppedSwTsoTx:0
   pktsZerocopyTxOK:0
   droppedTxExceedMTU:0
   pktsRxOK:6595337433
   bytesRxOK:2357816614826
   droppedRx:2934191332 <-- lots of dropped packets
   pktsSwTsoRx:0
   droppedSwTsoRx:0
   actions:0
   uplinkRxPkts:0
   clonedRxPkts:0
   pksBilled:0
   droppedRxDueToPageAbsent:0
   droppedTxDueToPageAbsent:0
}

# vsish -e get /net/portsets/DvsPortset-0/ports/50331669/vmxnet3/rxSummary
stats of a vmxnet3 vNIC rx queue {
   LRO pkts rx ok:0
   LRO bytes rx ok:0
   pkts rx ok:54707478
   bytes rx ok:19544123192
   unicast pkts rx ok:54707448
   unicast bytes rx ok:19544121392
   multicast pkts rx ok:0
   multicast bytes rx ok:0
   broadcast pkts rx ok:30
   broadcast bytes rx ok:1800
   running out of buffers:9325862
   pkts receive error:0
   1st ring size:4096 <-- this is a very large ring buffer size!
   2nd ring size:256
   # of times the 1st ring is full:9325862 <-- WHY packets are being dropped
   # of times the 2nd ring is full:0
   fail to map a rx buffer:0
   request to page in a buffer:0
   # of times rx queue is stopped:0
   failed when copying into the guest buffer:0
   # of pkts dropped due to large hdrs:0
   # of pkts dropped due to max number of SG limits:0
   pkts rx via data ring ok:0
   bytes rx via data ring ok:0
   Whether rx burst queuing is enabled:0
   current backend burst queue length:0
   maximum backend burst queue length so far:0
   aggregate number of times packets are requeued:0
   aggregate number of times packets are dropped by PktAgingList:0
   # of pkts dropped due to large inner (encap) hdrs:0
   number of times packets are dropped by burst queue:0
   number of packets delivered by burst queue:0
   number of packets dropped by packet steering:0
   number of packets dropped due to pkt length exceeds vNic mtu:0 <-- NOT the issue!
}

We were also able to notice that this VM had an Rx queue, per vCPU added to the VM (no additional settings to the VM settings were made to this specific Cent7 VM):

# vsish -e ls /net/portsets/DvsPortset-0/ports/50331669/vmxnet3/rxqueues
0/
1/
2/
3/
4/
5/
6/
7/

Each of the queues, can be dumped individually, to check Ring Buffer size (we did this and they were all 4096):

# vsish -e get /net/portsets/DvsPortset-0/ports/50331669/vmxnet3/rxqueues/1/status
status of a vmxnet3 vNIC rx queue {
   intr index:1
   stopped:0
   error code:0
   ring #1 size:4096 <-- if you use ethtool -G eth0 rx 4096 inside the VM it updates ALL queues
   ring #2 size:256
   data ring size:0
   next2Use in ring0:33
   next2Use in ring1:0
   next2Write:1569
}

# vsish -e get /net/portsets/DvsPortset-0/ports/50331669/vmxnet3/rxqueues/7/status
status of a vmxnet3 vNIC rx queue {
   intr index:7
   stopped:0
   error code:0
   ring #1 size:4096 <-- if you use ethtool -G eth0 rx 4096 inside the VM it updates ALL queues
   ring #2 size:256
   data ring size:0
   next2Use in ring0:1923
   next2Use in ring1:0
   next2Write:3458
}

So, that is where we are. We see the problem. Now, to fix it - that might be a separate post altogether.


Some additional knowledgebase sources of information on troubleshooting in VMWare environments:

  • MTU Problem

            https://kb.vmware.com/s/article/75213

  • Ring Buffer Problem

            https://kb.vmware.com/s/article/2039495

            https://vswitchzero.com/2017/09/26/vmxnet3-rx-ring-buffer-exhaustion-and-packet-loss/

SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...