At the end of the day I think we will end up with RDMA being the norm. Additiona...

kev009 · on April 3, 2017

But why?

We are fairly confident we can make BSD pump several hundred Gbps doing real world long haul TCP for content serving in the next couple years on something like Naples or POWER9.

At the other end, Isilon converted from Infiniband to OS TCP for the latest product: https://www.nextplatform.com/2016/05/06/emc-shoots-explosive.... That is pretty amazing because of low latency timing and incast.

To your point, Intel's Altera acquisition may eventually bear fruit but I'm not holding my breath and don't really know how to reason about it until an offload/accelerator ecosystem is built up.

en4bz · on April 3, 2017

Now that I've reviewed the Chelsio doc more carefully I've noticed that DDP is actually part of iWarp so the TCP offload you mentioned is just one part of iWarp that happens to be transparent to user space which is quite interesting. That being said iWarp has lost out to RoCE mainly because of latency and not throughput. I guess that I'm just more biased towards lower latency (at the same bandwidth) because it is empirically better, even though for most applications 10us vs 1us is negligible. That and because I work in low latency trading.

kev009 · on April 3, 2017

It's kind of the other way around, TCP is done in silicon and firmware and passed up to the OS via a scatter-gather engine which is the core architectural feature. So iWARP lends itself to layering on the TOE, as do other higher level protocols like iSCSI and FCoE. DDP requires intertwining with the OS networking stack and VM but the card is able to DMA the data from the offloaded TCP stream right into the application's address space sockbuf.

I'm out of my league on extreme low latency stuff but take a look at http://www.chelsio.com/wp-content/uploads/resources/T6-WDLat.... For comparison sake, do you know the end to end latency of Mellanox RoCE?

en4bz · on April 3, 2017

Yeah my rather old ConnectX-3 Pro 40Gbe is crushing all these numbers for the given sizes of 16 to 1024.

ib_send_lat is 0.77, 0.77, 0.80, 0.86, 1.17, 1.31, 1.58

ib_write_lat is 0.73, 0.75, 0.76, 0.85, 1.12, 1.29, 1.57

ib_read_lat is 1.38, 1.38, 1.39, 1.42, 1.50, 1.65, 1.91

However it's really difficult to judge these things without a detailed description of the test setup. This is over physical loopback on a single machine and even things like cable length can skew things at this level.

shaklee3 · on April 3, 2017

Can you elaborate on that? I've tried to find research or demos where people are doing 100+Gbps of useful work in software and haven't seen it. Power 9 will be here soon hopefully, and there's promise, but that doesn't seem like a given.

kev009 · on April 3, 2017

See drewg's comment. He is being modest in that they are also doing CPU encryption at 100G, so every single stream is different which doubles the usage of memory bandwidth.

I will get to 100G with Skylake, mainly because we have to rework our storage BoM rather than CPU improvements. Intel's focus has been off, they've somewhat misread where the market was going, but even today you have 40 PCIe 3.0 lanes (39GB/s), 67GB/s DDR4 memory bandwidth, and typically more than enough cores and threads to do whatever you want in a single E5 Xeon socket. Computers are _really_ fast, software is slow :)

So right out the gate, you have plenty of speeds and feeds to get stuff off disk and out the wire. That's exactly my workload, which entails pulling data off storage, into the VFS while kqueue is managing a pool of connected sockets, and when they are ready for more data, it goes out right from the page cache with sendfile. Netflix contributed some amazing work that makes FreeBSD particularly optimized for this workload.

In general the trick is to do less, batch more, and try not to copy things around. For example DPDK or Netmap packet forwarding, clear an entire soft ring at a time instead of one packet at a time. Using netmap, you can change ownership of the data by pointer swapping to move it from the rx ring to tx. The ACM Queue article on netmap is particularly good reading. Basically, pass by reference, but by understanding the memory layout of the system.

aio and sendfile kind of suck on Linux. epoll kind of sucks. Linux hugepages really suck, and the Linux VM seems to be biased toward massive concurrency or something that I don't really understand. None of these are monumental technical problems, but there is malignant culture at these hyperscalers because just being a bandwagon fan doesn't make you a winning team. Linux users tend to trust manufacturers to do all the device driver development correctly, and vendors like RedHat to drive general forward progress. How many patches does Amazon have in Xen? Linux?

At the BSD companies I've mentioned, a few dozen people across all of them have pulled this stuff off, and it's all there in the base system. We rip apart vendor drivers or entire subsystems when that's the prescription. We're pretty happy to share details and help others succeed at conferences or even by partnering up team to team across companies.

shaklee3 · on April 3, 2017

Thanks for your response. I'm very familiar with DPDK and am looking forward to the changes Skylake brings as well.