Linux Loopback performance with TCP_NODELAY enabled

Question

I recently stumbled on an interesting TCP performance issue while running some performance tests that compared network performance versus loopback performance. In my case the network performance exceeded the loopback performance (1Gig network, same subnet). In the case I am dealing latencies are crucial, so TCP_NODELAY is enabled. The best theory that we have come up with is that TCP congestion control is holding up packets. We did some packet analysis and we can definitely see that packets are being held, but the reason is not obvious. Now the questions...

1) In what cases, and why, would communicating over loopback be slower than over the network?

2) When sending as fast as possible, why does toggling TCP_NODELAY have so much more of an impact on maximum throughput over loopback than over the network?

3) How can we detect and analyze TCP congestion control as a potential explanation for the poor performance?

4) Does anyone have any other theories as to the reason for this phenomenon? If yes, any method to prove the theory?

Here is some sample data generated by a simple point to point c++ app:

Transport     Message Size (bytes)  TCP NoDelay   Send Buffer (bytes)   Sender Host   Receiver Host   Throughput (bytes/sec)  Message Rate (msgs/sec)
TCP           128                   On            16777216              HostA         HostB           118085994                922546
TCP           128                   Off           16777216              HostA         HostB           118072006                922437
TCP           128                   On                4096              HostA         HostB            11097417                 86698
TCP           128                   Off               4096              HostA         HostB            62441935                487827
TCP           128                   On            16777216              HostA         HostA            20606417                160987
TCP           128                   Off           16777216              HostA         HostA           239580949               1871726
TCP           128                   On                4096              HostA         HostA            18053364                141041
TCP           128                   Off               4096              HostA         HostA           214148304               1673033
UnixStream    128                   -             16777216              HostA         HostA            89215454                696995
UnixDatagram  128                   -             16777216              HostA         HostA            41275468                322464
NamedPipe     128                   -             -                     HostA         HostA            73488749                574130

Here are a few more pieces of useful information:

I only see this issue with small messages
HostA and HostB both have the same hardware kit (Xeon X5550@2.67GHz, 32 cores total/128 Gig Mem/1Gig Nics)
OS is RHEL 5.4 kernel 2.6.18-164.2.1.el5)

Thank You

If latencies are crucial, I'd switch to UNIX domain sockets (very similar to TCP sockets) or pipes (faster, but more complicated, you need two pipes for a bidirectional connection). They carry less baggage than TCP sockets and offer lower latencies. — Peter G., Apr 29 '11 at 13:04
These might not be pertinent questions, but I'm curious. What are the actual results that you are seeing in the two scenarios? What is the throughput and time? Also, is the test sending mostly in one direction, or is it more of an echo style test where the same amount of data is sent in a response? — Mark Wilkins, Apr 29 '11 at 13:13
@Mark I added the results of our testing to the main post. I also added a couple of other pertinent details. The tests are sending in one direction. — rns, Apr 30 '11 at 22:10
Thanks for posting the information; it is certainly interesting, but it doesn't really give me any useful thoughts. The difference is indeed quite drastic (10% of the throughput when TCP_NODELAY is enabled). It almost has the "feel" that every send request is blocking until it gets an ACK. But that doesn't make a lot of sense. — Mark Wilkins, May 02 '11 at 17:59

score 9 · Answer 1 · edited May 13 '11 at 16:37

1) In what cases, and why, would communicating over loopback be slower than over the network?

Loopback puts the packet setup+tcp chksum calculation for both tx+rx on the same machine, so it needs to do 2x as much processing, while with 2 machines you split the tx/rx between them. This can have negative impact on loopback.

2) When sending as fast as possible, why does toggling TCP_NODELAY have so much more of an impact on maximum throughput over loopback than over the network?

Not sure how you've come to this conclusion, but the loopback vs network are implemented very differently, and if you try to push them to the limit, you will hit different issues. Loopback interfaces (as mentioned in answer to 1) cause tx+rx processing overhead on the same machine. On the other hand, NICs have a # of limits in terms of how many outstanding packets they can have in their circular buffers etc which will cause completely different bottlenecks (and this varies greatly from chip to chip too, and even from the switch that's between them)

3) How can we detect and analyze TCP congestion control as a potential explanation for the poor performance?

Congestion control only kicks in if there is packet loss. Are you seeing packet loss? Otherwise, you're probably hitting limits on the tcp window size vs network latency factors.

4) Does anyone have any other theories as to the reason for this phenomenon? If yes, any method to prove the theory?

I don't understand the phenomenon you refer to here. All I see in your table is that you have some sockets with a large send buffer - this can be perfectly legitimate. On a fast machine, your application will certainly be capable of generating more data than the network can pump out, so I'm not sure what you're classifying as a problem here.

One final note: small messages create a much bigger performance hit on your network for various reasons, such as:

there is a fixed per packet overhead (for mac+ip+tcp headers), and the smaller the payload is, the more overhead you're going to have.
many NIC limitations are relative to the # of outstanding packets, which means you'll hit NIC bottlenecks with much less data when using smaller packets.
the network itself as per-packet overhead, so the max amount of data you can pump through the network is dependent on the size of the packets again.

score 1 · Answer 2 · edited Aug 05 '14 at 18:03

The is the same issue I faced,also. When transferring 2 MB of data between two components running in the same RHEL6 machine, it took 7 seconds to complete. When the data size is large, the time is not acceptable. It took 1 min to transfer 10 MB of data.

Then I have tried with TCP_NODELAY disabled. It solved the problem

This does not happen when the two components are in two different machines.

score 1 · Answer 3 · answered Apr 29 '11 at 19:11

1

1 or 2) I'm not sure why you're bothering to use loopback at all, I personally don't know how closely it will mimic a real interface and how valid it will be. I know that Microsoft disables NAGLE for the loopback interface (if you care). Take a look at this link, there's a discussion about this.

3) I would closely look at the first few packets in both cases and see if you're getting a severe delay in the first five packets. See here

answered Apr 29 '11 at 19:11

stackmate

780
7
15

I can't see anything in 'this link' that says MS disables Nagle for the loopback interface. THe thread appears to be about what happens when the *user* does that. He asks 'why is disabling Nagle's algorithm on loopback so much worse', which wouldn't be the case if it was already off. Can you clarify? – user207421 Apr 30 '11 at 12:23
our reason for using the loopback is we thought we would get better perf, but we have now stumbled on this phenomenon that we are trying to peg down a cause for. Also, there are real world use cases for using the loopback (because better performance is expected and it cuts down on hardware costs). – rns Apr 30 '11 at 22:12
That link wasn't related to Microsoft, it related to the question and maybe you could deduce something from the thread - I didn't think it held the answer though - but might discuss relevant ideas. – stackmate May 02 '11 at 17:42
@user207421: FYI, Microsoft's documentation [does say that Nagle is disabled automatically on the loopback interface](http://technet.microsoft.com/en-us/library/cc940037.aspx) (search for "loopback" at link): "The Nagle algorithm is not applied to loopback TCP connections for performance reasons." – ShadowRanger Oct 05 '18 at 17:30

Linux Loopback performance with TCP_NODELAY enabled

3 Answers3

Linked