Thursday, April 26, 2018

SDN Solutions


Note to self: Look into Aryaka Networks
I am told they have a very robust and mature orchestration solution.

Look into Thousand Eyes
Not sure what this is all about.

TCPKali Part II

I've spent about a week testing with this tcpkali tool now.

This is the test case I initially put together, which is to test connections between two CentOS 7 VMs via an Ubuntu VM (IP Forwarder / Router).


The tool spawns worker threads that test connection establishment and bandwidth on each of the connections, both based on an individual thread and collectively.

One thing I did run into with using Lighttpd, is that at high volume levels I kept getting invalid request headers. I wasn't sure if this was tcpkali messing up the headers, or if it was Lighttpd.

So I switched the architecture a bit (based on the suggestion of one of the other engineers in here). Instead of using Lighttd as the web server, I put tcpkali on both sides (left and right) - in client and server mode respectively.


This worked much better than using Lighttpd as a web server. You can send messages and run the server in active or silent mode, returning responses or discarding them.

The tcpkali has some great features. One HUGE feature is the ability to send requests per second with the -r option. The tool attempts to ramp up to the specified number of connections, and if it cannot do this, it will bail without examining throughput measurements. I did notice that it sends data with the -r option DURING the ramp-up period, in addition to after. The tool also has other features like the connect rate, so that if you are establishing 100K connections you can speed up that process by doing 10K or 20K a clip rather than smaller default limits.

What I discovered in the testing I was doing, was that the connection rate had a profound effect on the number of connections you could establish. The default connection rate is 100. And this works fine for up to about 10,000 connections. But if you try to go to higher than that, the rate starts to inhibit your ability to scale the connections and you need to tweak it higher - just not too much higher.
The sweet spot I discovered was a 1,000 : 1 ratio. So 10,000 connections would use a rate of 100 connections per second, 20,000 would use a rate of 200 connections per second, and so forth.

Thursday, April 12, 2018

TCPKali

Today the QA team told me they had selected a tool called TCPKali (but weren't wed to it). They wanted me to download it, compile it, and try it out, as they thought they would use this to test connection scaling from clients running our product.

So I did this yesterday evening. It requires a number of kernel parameters to be changed so you can actually scale. This took me down a deep road of TCP State Transition and TIME WAIT. Rather than just blindly change kernel parameters, I read a couple of good articles on this:

http://developerweb.net/viewtopic.php?id=2941

http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html

Okay...now I know why I am changing the reuse and recycle parms on TIME WAIT.

I actuallly tested this by setting up a Lighttpd web server and using the man page examples. I was thinking Lighttpd would be closing the socket connections down and going into TIME WAIT state but in my testing I saw that the client (TCPKali) was actually racking up sockets in this state until I set those kernel parms which kept that number down to a lower number.

Libnet and Libpcap

In the Jon Erickson book, he discusses the differences between libnet and libpcap.

Libnet is used to send packets (it doesn't receive).

Libpcap is used to filter (receive) packets - it doesn't send.

So you need both modes to have, well, "a full duplex solution".

I downloaded and compiled a bunch of libnet code examples so I can fiddle around and send packets under different example scenarios.  It's fairly easy to use, I think. All in C language.

Libpcap is a library that allows you to initialize a listener that goes into a loop, and you can pass in a BPF (Berkeley Packet Filter) and a Callback function that can handle packets that are fed into the callback function based on the filter criteria.

I had issues running the libpcap on VirtualBox virtual machines that had a bridged interface to the host. I need to re-run the code from the libpcap tutorial I was doing on a dedicated Linux box, or maybe change the adaptor type on the Virtual Box VMs.

Security and Hacking - Part I

I don't usually write much about Security and Hacking, but I will need to do a little bit of that because that is what I have been working on lately.

I went to the RSA show a couple years ago and that bootstrapped my involvement in security. The Dispersive DVN, after all, is all about Security.  We have had a number of people come in and Pen Test the networks, and I have read those reports.  Recently, as part of Research, once I finished Orchestration, they asked me if I would bolster my skills in this area and do some internal pen testing of our network.  This is a big undertaking, to say the least.

I started with a book called Hacking (2nd Edition), The Art of Exploitation, by Jon Erickson. This book is not for the script kiddies. It uses practical Assembler and C examples on a (dated) version of Ubuntu that you compile and run as part of going through the book.  I have gone through the entire book, page by page. I've learned some very interesting things from this book. Where I kind of got lost was in the ShellCode sections - which is essentially the one key point that separates the port scanners and tire kickers from the guys who know how to actually exploit and break into networks and systems.  I will need to go through this book, and these sections, probably iteratively to actually master the skills presented in this book.

I've built a "Pen Testing" station - on an Ubuntu VM and this VM is essentially my "attack plane" for the OpenStack network. It sits outside the OpenStack networks but can route to all of the networks inside OpenStack via the OpenStack router.

So far, I have run a series of half-open port scans and documented all of the ports I've been finding open on various network elements.

It appears that someone in a Load Testing group is trying to lasso me out of research and "make" me join this load testing team, which will make this an extracurricular effort if they succeed in doing this.

QoS with OpenBaton and OpenStack - Research Findings

Earlier on, I tried to get the Network Slicing module of OpenBaton working.

Network Slicing is, essentially, QoS. There are some extremely deep papers written on Network Slicing from a conceptual perspective (I have read the ones from Fraunhofer Fokus think tank in Berlin). For brevity I won't go into that here (maybe I will follow up when I have more time).

Initially, I had to change a lot of OpenBaton's code to get it to even compile and run. But it didn't work. After reading some papers, I decided that maybe the reason it wasn't working was because I was using LinuxBridge drivers and not OpenVSwitch (my theory was that the flows might be necessary for calculating QoS metrics on the fly).

Having gotten OpenVSwitch to work with my OpenStack, I once again attempted to get OpenBaton's Network Slicing to work. Once again I had to change the code (it did not support SSL) but was able to get it working (or should I say running) off the develop git branch. I tested the QoS with a "Bronze" bandwidth_limit policy on one of my network elements (VNFM), and it did not work. I set up two iPerf VMs and blew right past the bandwidth limit.

This led me to go back to OpenStack and examine QoS there, an "inside out" troubleshooting approach.

My OpenStack (Newton Release - updated) supports DSCP and Bandwidth Limit policies. It does not support minimum_bandwidth, which the Open Baton NSE (Network Slicing Engine) examples apply. So with that confirmed, I went ahead and used Neutron to apply a 3000kbps (300 burst) minimum bandwidth policy rule not only on the tenant network of a running VM (192.168.178.0/24), but I also put the policy on a specific port that the VM was running on (192.168.178.12). I went ahead and re-tested the QoS with iPerf, and lo and behold, I saw the traffic being throttled. I will mention that the throttling seemed to work "smoother and better" with TCP traffic on iPerf than with UDP traffic. UDP traffic on iPerf is notably slower than TCP anyway, because of things like buffering and windowing and such.

With this, I went back and re-tested the OpenBaton Network Slicing Engine and what I realized is that the OpenBaton NSE did not seem to be communicating with the OpenStack API to apply the QoS rule to the port. I mentioned this on Gitter to the guys at OpenBaton.

I doubt right now I will have time to go in and debug the source code to figure this out. They have not responded to my inquiry on this. There seems to be only one guy over in Europe that discusses QoS on that forum.

I would like to go back to the OpenStack and re-test the DSCP aspect of QoS. OpenBaton does not support DSCP so this would be an isolated exercise.

I will summarize by saying that there are NUMEROUS ways to "skin the cat" (why does this proverb exist?) with respect to QoS. A guy at work is using FirewallD (iptables) to put rate limiting in on the iptables direct rules as a means of governing traffic.  OpenVSwitch also has the ability to do QoS. So does OpenStack (which may use OpenVSwitch under the hood if you are using OVS drivers I guess). With all of these points in a network that MIGHT be using QoS, it makes me wonder how anything can actually work at the end of the day.

SLAs using Zabbix in a VMware Environment

 Zabbix 7 introduced some better support for SLAs. It also had better support for VMware. VMware, of course now owned by BroadSoft, has prio...