6

I wrote user-mode client-server application based on Berkeley sockets that interact over some private network.
The situation is definitely strange. Occasionally the connection becomes very slow under some vague circumstances. The normal TCP data exchange in my case is about 10-25 Kbytes payload per segment, but sometimes it becomes about ~200-500 bytes per segment.

After some troubleshooting, I realized that this problem is not reproducible for other network services, thus it looks like my service is to blame. But I can't figure out, what's wrong. It worked well on 3.10 Linux kernel, but have that strange behavior on 4.4. Could it be some internal kernel changes which caused such problem?

I tried to play with Linux sysctl settings:

net.ipv4.tcp_congestion_control
net.ipv4.tcp_sack
net.ipv4.route.flush

but that did not help.

Seems that the problem appears at listen socket side. In tcpdump the TCP Window size is OK while handshaking. But after first incoming packet window size reduces (by listener's side).

UPD
Here is my server-side code snippet:

 serv_fd = socket(AF_INET, SOCK_STREAM, 0); 
 if (serv_fd == -1) {
      perror("socket");
      return;
 }   

 server.sin_family = AF_INET;
 server.sin_port = htons(LISTEN_PORT);
 server.sin_addr.s_addr = htonl(INADDR_ANY);

 #ifdef SET_BUF
 if (setsockopt(serv_fd, SOL_SOCKET, SO_RCVBUF, &buflen, sizeof(int)) == -1) {
      perror ("setsockopt");
      return;
 }   
 if (setsockopt(serv_fd, SOL_SOCKET, SO_SNDBUF, &buflen, sizeof(int)) == -1) {
      perror ("setsockopt");
      return;
 }   
 #endif // SET_BUF

 if (bind(serv_fd, (struct sockaddr *) &server, sizeof(server)) == -1) {
      perror("bind");
      return;
 }   

 if (listen(serv_fd, 3)) {
      perror("listen");
      return;
 }   

 printf("Server is listening on %u\n", LISTEN_PORT);

Could someone shed some light on my problem? I would be very grateful!
Can it be related to some recent Linux kernel modifications? Do I need to tune some Linux kernel settings or check some user-mode settings (f.e. socket options or whatever)?

P.S. The problem is unstable.

UPD:

tcpdump's output:

IP 10.0.0.34.31334 > 10.0.0.99.12345: Flags [S], seq 426261790, win 43690, options [mss 65495,sackOK,TS val 799180610 ecr 0,nop,wscale 7], length 0
IP 10.0.0.99.12345 > 10.0.0.34.31334: Flags [S.], seq 803872704, ack 426261791, win 65483, options [mss 65495,sackOK,TS val 799180567 ecr 799180610,nop,wscale 0], length 0
IP 10.0.0.34.31334 > 10.0.0.99.12345: Flags [.], ack 1, win 342, options [nop,nop,TS val 799180610 ecr 799180567], length 0
IP 10.0.0.34.31334 > 10.0.0.99.12345: Flags [P.], seq 1:1301, ack 1, win 342, options [nop,nop,TS val 799180610 ecr 799180567], length 1300
IP 10.0.0.34.31334 > 10.0.0.99.12345: Flags [P.], seq 1301:1804, ack 1, win 342, options [nop,nop,TS val 799181412 ecr 799180610], length 503
IP 10.0.0.99.12345 > 10.0.0.34.31334: Flags [.], ack 1804, win 512, options [nop,nop,TS val 799181412 ecr 799181412], length 0

10.0.0.34.31334 is a client, 10.0.0.99.12345 is a server. Pay attention to unexpected win 512 in the last line.

UPD2: I saw several messages about SYN-cookies in dmesg like:

possible SYN flooding on port 12345. Sending cookies.

But they are not so time related with slow transmissions.

z0lupka
  • 196
  • 3
  • 16
  • 1
    Quite a strange situation. How can you be sure that the problem did not appear on 3.10 kernel if *"The problem is unstable"*. Take a look at `dmesg` output. Is there something unusual in moments of slow transmission? – red0ct Jan 16 '20 at 17:39
  • 1
    Is your 4.4 machine's network configuration correct? Strange problems occur if, for example, your IP is already used by some other machine. – jxh Jan 17 '20 at 18:26
  • @jxh Thank you for good idea, but seems that network settings are ok. Still can't find any reason.. – z0lupka Jan 18 '20 at 14:18
  • @red0ct The problem didn't appear on 3.10 kernel very long time. – z0lupka Jan 18 '20 at 14:20
  • @red0ct it's hard to tell what exactly is unusual in dmesg since I don't know what is usual. There are some networking and network drivers related things – z0lupka Jan 18 '20 at 15:24
  • Please edit your question and add some clarifying info. What exactly you see in `tcpdump`? What about `dmesg`? – red0ct Jan 18 '20 at 19:38
  • You say that the window size is OK on the handshake, but because syn cookies are being used, the actual initial SYN parameters are encoded into the sequence number and timestamp in the SYN ACK which you would have to decode to determine that they were "OK." I have heard allegations that wscale may no longer be handled properly when using syn cookies. I would suggest increasing the backlog in your listen() to something much larger than 3 and checking and increasing SOMAXCONN as necessary to allow your larger backlog to prevent syn cookies from being used. – Jim D. Jan 24 '20 at 19:40
  • 2
    Note that a Cloudflare blog entry [SYN packet handling in the wild](https://blog.cloudflare.com/syn-packet-handling-in-the-wild/) points out that "In kernels before 4.3 the SYN Queue length was counted differently." I haven't looked at it carefully, but I think your backlog of 3 got rounded up to 16 prior to 4.3. This also suggests increasing your backlog to at least 16 might get you performance similar to what you saw in 3.10. Change linked to by blog [here](https://github.com/torvalds/linux/commit/ef547f2ac16bd9d77a780a0e7c70857e69e8f23f#diff-56ecfd3cd70d57cde321f395f0d8d743L43). – Jim D. Jan 24 '20 at 20:00
  • 2
    You should also check if SYN cookies were enabled in your 3.10 deployment. If they weren't, the SYN would just be dropped and client would retransmit. [Here](https://hustcat.github.io/tcp_syn_cookies_and_window_size/) is someone else observing a similar problem with window size and syn cookies when timestamps are enabled. (If you can't read Chinese, load in Chrome, right click and select translate which will do a decent job). According to the cloudflare blog they have timestamps disabled (which is where wscale is/was stored) and hence could not have seen this problem. – Jim D. Jan 24 '20 at 20:08
  • @JimD. *"I think your backlog of 3 got rounded up to 16 prior to 4.3"* - could you provide some clarification about rounding of backlog value in earlier kernels? – z0lupka Jan 24 '20 at 21:59
  • 2
    @z0lupka Look at [this change](https://github.com/torvalds/linux/commit/ef547f2ac16bd9d77a780a0e7c70857e69e8f23f#diff-56ecfd3cd70d57cde321f395f0d8d743L43) that eliminated `nr_table_entries` which used to have a minimum value of `8+1` rounded up to a power of 2 = `16`. I didn't actually try to follow the value of `backlog` from `listen()`. To really prove the SYN Queue length was `16` based on a `backlog` value of 3 would take a bit more investigation. – Jim D. Jan 24 '20 at 22:20
  • 2
    @z0lupka Somebody has done the work to follow this through [here](https://blog.dubbelboer.com/2012/04/09/syn-cookies.html) – Jim D. Jan 24 '20 at 22:57
  • @Mr.AF I don't deny your ideas, man! But it looks like you work with some kind of high-level "socket" wrappers which has almost nothing to do with Berkeley sockets API and far from TCP. And surprisingly, you recommend to use HTTP instead of TCP - it's absolutely unclear. [OSI model](https://geek-university.com/uncategorized/data-encapsulation-in-the-osi-model/) – z0lupka Jan 25 '20 at 11:51
  • What is the value of `buflen`? What is the value after the first `setsockopt` returns? Why isn't it initialized in your code? – Myst Jan 26 '20 at 21:47
  • @Myst "Code snippet" implies that it is **snippet** - piece of code, i. e. not all code. `buflen` is 65000. `setsockopt()` returns with success. – z0lupka Jan 27 '20 at 08:00
  • Have you tried to disable the Nagle algorithm for tcp? have a look at this https://stackoverflow.com/questions/17842406/how-would-one-disable-nagles-algorithm-in-linux – Mihai Jan 28 '20 at 08:49
  • @mihai I think it's not a good idea, because my client-server application uses a huge portions data exchange, which become very slow with `512` TCP window size. How can disabling Nagle's algorithm help then? – z0lupka Jan 28 '20 at 09:10

1 Answers1

8

I'm not really sure this is exactly your case, but it looks similar. Seems that it's a known problem.

Reasons

A number of circumstances can lead to such Linux kernel behavior:

  • Specificity of kernel connection handling in SYN-cookies context with connections having zero Window Scale (or if WS modified in some other way).
  • Zero Window Scale you provoked by setsockopt() with SO_RCVBUF (see tcp_select_initial_window())
  • Extremely small backlog

Explanation

About "slow" transmission:
The Windows Scaling option is calculated at [SYN - SYN+ACK] stage by both hosts. Roughly speaking Host A says "imply my TCP window size on N during future exchange" (SYN) then Host B says "imply my TCP window size on M during future exchange" (SYN+ACK) - here N and M could be th same. So, in a normal situation, these coefficients are stored and eventually used while data exchange.
But TCP SYN-cookies technique implies forgetfulness about [SYN - SYN+ACK] stage of connection (some stated options including WS will be lost after SYN+ACK). In that case Linux kernel re-calculates WS value when ACK arrives (if the ACK has arrived, then creating a regular connection is needed). But that second recalculation could be a bit different because setsockopt() does not affect it (for some objective reasons). Here you face with situation, when your server sends zero Window Scale option with SYN+ACK, then forgets about it, then re-spawn connection (when ACK arrives) as it was with some default Window Scale (e.g. 7) and use little window implying that client will multiply it by 128. But client doesn't forget that WS is 0 and treats little window size as real - hence it send a little portions of data - hereby your "slow" connection takes the stage.

About SYN-flood:
When you have such a little backlog an simple 3 SYN-retransmits can provoke SYN-cookies (i.e. will fill in your backlog queue). BTW do you see retransmissions in tcpdump?
From ip-sysctl.txt:

Note, that syncookies is fallback facility.
It MUST NOT be used to help highly loaded servers to stand
against legal connection rate. If you see SYN flood warnings
in your logs, but investigation shows that they occur
because of overload with legal connections, you should tune
another parameters until this warning disappear.
See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow.

syncookies seriously violate TCP protocol, do not allow
to use TCP extensions, can result in serious degradation
of some services (f.e. SMTP relaying), visible not by you,
but your clients and relays, contacting you. While you see
SYN flood warnings in logs not being really flooded, your server
is seriously misconfigured.

So if there are no SYN-flood attacks in your LAN - your server is seriously misconfigured. SYN-cookies should do its job only when SYN-flood attack is present.


Solution

Concluding, there can be some activities to eliminate the problem:

  1. If there is a real SYN-flood in your network - SYN-cookies partially solve this information security issue. With a real attack, there’s no time to think about slow connections. This is an emergency.
  2. If nope, i.e. some SYN-retransmissions provoked SYN-cookies:
    • thoughtfully increase backlog to eliminate such conditions;
    • don't do setsockopt() with SO_RCVBUF on listening socket. It doesn't make much sense. Without doing setsockopt() you can reduce the probability of different WS calculations by kernel in mentioned scenario. Btw you can set SO_RCVBUF on accepted socket if needed.

Repro

I reproduced your problem with simple client and server with hping3 under approximate conditions. So you can stuff server's backlog queue with:

hping3 -c 3 -S -p 12345 --fast 10.0.0.99

then initiate connection from client - the connection will be opened in the so-called "SYN-cookies context" at least on 4.4 kernel. You can also check it on 3.10 kernel increasing -c from 3 to X up to successful reproduction.

red0ct
  • 4,231
  • 3
  • 14
  • 37
  • Ok, but what about 3.10 kernel? It's my production code, so I must be aware of possible unwanted behavior. – z0lupka Jan 24 '20 at 21:57
  • And yes, there is no SYN-flood in my LAN, but there are several syn re-transmissions – z0lupka Jan 24 '20 at 22:40
  • 1
    @z0lupka you can simulate such SYN retransmissions through [`hping3`](https://linux.die.net/man/8/hping3) utility and check the 3.10 kernel's behavior. Something like `hping3 -d 120 -S -p 12345 10.0.0.99 --fast` then immediately initiate connection from client. – red0ct Jan 24 '20 at 22:51
  • Thanks a lot! Backlog increasing helped. – z0lupka Jan 28 '20 at 08:45