10

I have two servers listening on a TCP port behind a load balancer. The load balancer can detect if a TCP connection attempt from a client was unsuccessful and retry it to the second server without dropping that connection. I want to be able to bring any of these two servers down for maintenance without dropping a single client collection.

My servers use this code to process client requests:

ServerSocketFactory ssf = ...
ServerSocket serverSocket = ssf.createServerSocket(60000);
try {
    while (true) {
        Socket socket = serverSocket.accept();
        ...// Do the processing
    }
} catch (IOException e) {
    ...
}
...

My initial thought was to add a boolean that would be set on application shutdown and prevent new serverSocket.accept() calls while waiting for all existing connection to be processed and closed. However, new connection are being established even before the serverSocket.accept() call. Here's what I see in Wireshark if I put a breakpoint before that call. enter image description here The problem is at this point as soon as I call serverSocket.close(), all such client connections get dropped. What I want to achieve is some way of telling ServerSocket to stop accept all new connections (i.e. only send RST for new connections or let them time out), so the load balancer can reroute them to another server, but at the same time not drop any already established connections.

Edit: I'm looking for some automated solution which wouldn't require me to change any load balancer or OS settings every time I want to update the application.

John29
  • 2,830
  • 2
  • 27
  • 48
  • 3
    I believe its the function of your loadbalancer which detect which servers are available to serve and forward the new connections to that. I would not make your Socket code more complex as its not it function. Look at your loadbalancer health check and routing config. If you take one of the server ip out of the balancer routing table then new connections would go the other way. – Minh Kieu Jun 09 '17 at 18:45
  • Yes, it's the function the loadbalancer to detect available servers and it does its job. It detects if the connection was not established (reset or timed out) and forwards that connection to another server. The problem is my server keeps accepting new connections, even when I don't want it to, before I call `serverSocket.close()`. And as soon as I call it all already established connections for which `serverSocket.accept()` wasn't called yet got dropped. The loadbalancer can't help me in this case because it sees these connections as established and assumes that the server can handle them. – John29 Jun 09 '17 at 18:51
  • It does depends on how your loadbalancer is configured. The loadbalancer(F5) I worked with has a health check config. We exposed a servlet-endpoint for the LB to monitor. If we make this service unavailable, the LB will think the server is down and will not route new traffc there. – Minh Kieu Jun 09 '17 at 18:57
  • I also work with F5 and and it does have health checks. I use the [Reselect Tries option and Inband monitor](https://support.f5.com/csp/article/K10640), so it always retries to connect to another server if a connection fails. But it still doesn't solve the above problem because these connections only fail after they are already established. Using some of the "check every n seconds" health monitors instead of Inband is even worse because there's always a timeframe between I bring a server down and when F5 sends next health check request and detects that it's down. – John29 Jun 09 '17 at 19:18
  • The other option is to use Apache grateful shutdown however I don't know if its applicable to sockets. I know it works for HTTP connections. – Minh Kieu Jun 09 '17 at 20:18
  • 2
    This one was quite interesting topic. You should read http://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html . It seems there is no way to do what you want, because setting application level ack queue to zero is impossible in java (and it would not make much sense anyways) You really should do what others suggested and implement another health check – Sami Korhonen Jun 12 '17 at 20:27
  • @SamiKorhonen Thanks for the link. The F5 load loadbalancer has [2 kinds of health checks](https://support.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/ltm_configuration_guide_10_0_0/ltm_appendixa_monitor_types.html): Inband that I'm currently using and "check every n seconds", but none of them solves the problem, see my previous comment. Do you know any kind of health check that can avoid losing any client connections? – John29 Jun 12 '17 at 21:02

4 Answers4

4

You could add firewall rule on the server which will block new but keep old connections active. I guess the server is Linux based? If so, you could try with:

iptables -A INPUT -p tcp --syn --destination-port <port> -j REJECT --reject-with icmp-host-prohibited

After that you can check with netstat is there any active connection and bring the application down once when there is no any:

netstat -ant|grep <port>|grep EST

After you finish with the maintenance, you can remove the firewall rule. First, list all the rules to find it:

iptables -L -n

And remove it:

iptables -D INPUT <rule number>
  • Thanks for the answer, but unfortunately it won't work in my case because I don't have root on the servers. I can ask to change iptables once of course, but not every time I'm updating the application. – John29 Jun 12 '17 at 23:58
3

At any point when ServerSocket.accept() blocks, or ServerSocketChannel.accept() returns null, the backlog queue is empty. At that point, stop accepting and close the listening socket. Wait for all existing accepted sockets to finish their work and let the application exit at that point.

user207421
  • 289,834
  • 37
  • 266
  • 440
  • How can I reliably determine when `ServerSocket.accept()` blocks? Sometimes I can get up to around 50 new connections in one second, so even if I can determine when it blocks, between that moment and the moment I call `ServerSocket.close()` a new connection can be established. It's a race condition. – John29 Jun 13 '17 at 23:21
  • Just add `volatile boolean blocking;` to your class, set it to `true` when you call `accept()`, and `false` as soon as it returns. Imprecise of course. The race condition is unavoidable too. Can you use a `ServerSocketChannel`? If so, when `Selector.select()` times out with only the `ServerSocketChannel` registered (for OP_ACCEPT), the backlog queue is not only empty but has been empty for the timeout period. You can just convert the resulting accepted `SocketChannels` to `Sockets` as before, pretty much. – user207421 Jun 14 '17 at 00:28
  • But `ServerSocketChannel` wouldn't avoid the race condition, right? It would just make it less likely for a new connection to be established before I close the socket because there were no new connections for some time. – John29 Jun 14 '17 at 00:50
  • You can't avoid the race whatever you do, given that the backlog queue exists and you have no control over it. – user207421 Jun 15 '17 at 00:48
  • That's what I thought. It looks like the only 100% way to prevent data loss is to make the clients wait and retry in case of a failed connection. I was hoping that a more elegant solution exists since what I'm trying to achieve should be a pretty common task when high availability is required. Thanks for the answer anyway. – John29 Jun 15 '17 at 02:14
  • You would need to first switch this server off at the load balancer, however you do that, then let it exhaust its own backlog queues and finish with all current connections. – user207421 Jun 15 '17 at 06:41
  • Unfortunately it's not an option as I don't have access to the production F5 load balancer. That's why I mentioned that I need some automated way in my question. – John29 Jun 15 '17 at 14:43
  • It would be interesting, on Linux (where 'tis claimed you can dynamically change the accept/SYN queue by calling listen() a second time), to test if setting the accept queue to 0 lets you finish accepting connections in the queue without accepting any new ones. https://stackoverflow.com/questions/43078649/listen-called-on-socket-more-than-once-expected-behavior – Ron Burk Oct 04 '20 at 03:11
0

The easiest way to solve your problem is to put additional load balancer locally right before your application server.

Check nginx and HAproxy and chose on of them, which is better for your task. They both have a feature for graceful shutdown, which means that they stop accepting new connections but continue serving existing to the end. Another advantage is that your application doesn't require any changes in code.

Graceful shutdown for nginx:

nginx -s quit

Graceful shutdown for HAproxy:

haproxy -sf $(cat /var/run/haproxy.pid)
berserkk
  • 887
  • 4
  • 9
  • I don't see how it would help for the connections already in backlog, see the link that Sami Korhonen has provided http://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html. It would still drop these connection that are already established, but for which nginx or HAproxy didn't call `accept` yet. – John29 Jun 19 '17 at 16:38
  • So, there's no way in fixing your issue. You can't write better connection processing in Java than `nginx` and `HAproxy` already have. – berserkk Jun 19 '17 at 16:50
0

I came to the conclusion that what I'm trying to achieve is not possible on Linux. The problem is the OS completes the initial handshake with the clients by sending the SYN,ACK and ACK packet, without any control over this process by the application. After the handshake, the connection becomes established and the OS puts it in the backlog queue. As soon as the connection is established, the load balancer that I'm using (F5 BigIP) doesn’t forward it to another server under any circumstances, regardless of what kind of health checks I have there. When I close the socket, already established but not yet accepted connections from the backlog queue got dropped.

However, it's possible to achieve with Windows using the SO_CONDITIONAL_ACCEPT socket option and WSAAccept function of the Windows Sockets C++ API. This option allows the application to control the initial handshake. A good explanation can be found in this answer:

When calling listen() on a port, the OS starts accepting connections on that port. This means that is starts replying SYN,ACK packets to connections, regardless if the C code has called accept() yet. ... However, on windows, the SO_CONDITIONAL_ACCEPT call allows the application to take control of the backlog queue. This means that the server will not answer anything to a SYN packet until the application does something with the connection. This means, that rejecting connections at this level can actually send RST packets to the network without creating state.

It looks like Linux doesn't have a similar feature, as described in this answer:

The three-way handshake is a part of the basic structure of tcp/ip, so it's imbeded in the stack (i.e. kernel level). All the non-kernel code you get your hands on operate AFTER the handshake.

John29
  • 2,830
  • 2
  • 27
  • 48