7

I have a Thrift API served from a Java application running on Linux. I'm using a .NET client to connect to the API and execute operations.

The first few calls to the service work fine without errors, but then (seemingly at random) a call will "hang." If I force-quit my client and try to reconnect, the service either hangs again, or my client has the following error:

Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   at Thrift.Transport.TStreamTransport.Read(Byte[] buf, Int32 off, Int32 len) 
   (etc.)

When I use JConsole to get a thread dump, the server is on accept()

"Thread-1" prio=10 tid=0x00002aaad457a800 nid=0x79c7 runnable [0x00000000434af000]
   java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
    - locked <0x00000005c0fef470> (a java.net.SocksSocketImpl)
    at java.net.ServerSocket.implAccept(ServerSocket.java:462)
    at java.net.ServerSocket.accept(ServerSocket.java:430)
    at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:113)
    at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
    at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
    at org.apache.thrift.server.TSimpleServer.serve(TSimpleServer.java:63)

netstat on the sever shows connections to the service port that are on TIME_WAIT which eventually disappear several minutes after I force-quit the client (as would be expected).

The code that sets up the Thrift service is as follows:

        int port = thriftServicePort;
        String host = thriftServiceHost;
        InetAddress adr = InetAddress.getByName(host);
        InetSocketAddress address = new InetSocketAddress(adr, port);
        TServerTransport serverTransport = new TServerSocket(address);
        TServer server = new TSimpleServer(new TServer.Args(serverTransport).processor((org.apache.thrift.TProcessor)processor));

        server.serve();

Note that we're using the TServerTransport constructor that takes an explicit hostname or IP address. I suspect that I should change it to take the constructor that only specifies a port (ultimately binding to InetAddress.anyLocalAddress()). Alternatively, I suppose I could configure the service to bind to the "wildcard" address ("0.0.0.0").

I should mention that the service is not hosted on the open Internet. It is hosted in a private network and I am using SSH tunneling to reach it. Hence, the hostname that the service is bound to does not resolve in my local network (although I can make the initial connection via tunneling). I wonder if this is something similar to the RMI TCP callback problem?

Is there a technical explanation for what's going on (if this is a common issue), or additional troublehshooting steps that I can take?

UPDATE

Had the same problem today, but this time jstack shows that the Thrift server is blocking forever reading from the input stream:

"Thread-1" prio=10 tid=0x00002aaad43fc000 nid=0x60b3 runnable [0x0000000041741000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22)
        at org.apache.thrift.server.TSimpleServer.serve(TSimpleServer.java:70)

So we need to set a "client timeout" in the TServerSocket constructor. But why would that cause the application to also refuse connections when blocking on accept()?

noahlz
  • 9,749
  • 7
  • 53
  • 72
  • Possibly a related (or the same) issue? http://stackoverflow.com/questions/6059071/apache-thrift-client-timeout-issues – noahlz Jan 25 '13 at 22:04

4 Answers4

4

From your stack trace it seems you are using TSimpleServer, whose javadocs say,

Simple singlethreaded server for testing.

Probably what you want to use is TThreadPoolServer.

Most likely what is happening is the single thread of TSimpleServer is blocked waiting for the dead client to respond or timeout. And because the TSimpleServer is single threaded, no thread is available to process other requests.

sbridges
  • 24,059
  • 3
  • 60
  • 70
  • Ah, but why does it continue to hang/refuse connections even after the client is killed and the stacktrace is on ´accept()´? – noahlz Jan 28 '13 at 14:18
  • Are you sure you are blocked on accept when you couldn't connect? If there are no connections it is normal to block on accept, you may have acquired the stack trace after the Socket.read call timed out. The second stack trace is consistent with a single threaded server waiting for a client. – sbridges Jan 28 '13 at 16:30
  • Yes, I am sure. The application gets in two different states: hanging on `read` and rejecting all requests, while the stacktrace has `accept`. – noahlz Jan 28 '13 at 16:42
  • I fear that I will not be able to test your proposed solution before the bounty expires. I was hoping for more detailed technical analysis of the information I've provided. I'll likely accept your answer if no one else provides something better, however. – noahlz Jan 29 '13 at 17:09
3

I have some suggestions. You mentioned that the first few calls to the server works and then there are hangs. That's a clue. One scenario where this happens is when the client does not fully send the bytes to the server. I am not familiar with TSimpleServer, but I assume it listens on a port and has some binary protocol and expects any client to talk to it in that protocol. Your .net client is talking to this server by sending bytes. If its not correctly flushing its output buffer then it may not be sending all the bytes to the server thereby hanging the server.

In Java this could happen at the client side ,like this :

BufferedOutputStream stream = new BufferedOutputStream(socket.getOutputstream()) //get the socket stream to write 
stream.write(content);//write everything that needs to be written 
stream.flush();//if flush() is not called, could result in server getting incomplete packets resulting in hangs!!!

Suggestions :

a) Go through your .net client code. See if any part of the code that actually communicates to the server are properly calling the equivalent flush() or cleanup methods. Note : I saw from their documentation that their transport layer defines a flush(). You should scan your .net code and see if its using the transport methods. http://thrift.apache.org/docs/concepts/

b) For further debugging, you could try writing a small Java client that simulates your .net client. Run the java client on your linux machine (same machine where TSimpleServer runs). See if it causes same issue. If it does, you could debug your java client and find the root cause. If it doesn't, you could then run it on where your .net client runs and see if there any issues and take it from there.

Edit :c) I was able to see a sample thrift client code in Java here : https://chamibuddhika.wordpress.com/2011/10/02/apache-thrift-quickstart-tutorial/ I noticed transport.open(); //do some code transport.close(); As suggested in a) you could go though your .net client code and see if you are calling the transport methods flush() and close() on completion

Zenil
  • 1,373
  • 3
  • 11
  • 20
  • The Thrift client code is almost entirely in generated code and the libthrift library. So the defect would have to be in core Thrift. That's a tough case to make. – noahlz Jan 31 '13 at 17:36
  • The defect could also be the way your "transport" object is setup. (I am referring to the .net equivalent of the transport object in the java example ). For debugging purposes you could do option (b), basically generate a Java thrift client. If it works, then comparing the 2 clients will give you clue. If it doesn't ,now you have 2 clients with the same problem - which could mean a possible defect in thrift code or some kind of network issue. – Zenil Jan 31 '13 at 18:48
  • We're using core Thrift .NET library code: `TTransport transport = new TSocket(HOST, PORT); TProtocol protocol = new TBinaryProtocol(transport); transport.Open();` You're basically saying that there is a core defect in Thrift. – noahlz Feb 01 '13 at 15:01
  • @noahz A "generated" code will work as is in most cases is a good expectation. But in some cases like here where the code interacts with sockets and network and possibly running into n/w latency issues, you have to tweak it. So I am basically saying that you could try playing with your transport object (one of them being calling flush() just before close()).. – Zenil Feb 01 '13 at 15:16
  • All the API calls generated by the Thrift compiler end with `TTransport.Flush()` – noahlz Feb 01 '13 at 15:30
  • Ok..Can't think of any thing else for now. If you have no other tips, you could try generating a java client and see if that works.. . – Zenil Feb 01 '13 at 15:44
  • I modified the .NET client to add an extra `Flush()` - didn't resolve the issue. – noahlz Feb 01 '13 at 19:44
0

Biding the Thrift service to the wildcard address ("0.0.0.0") solved the problem, no more hanging.

Using the multithreaded server would make the application more responsive, but would still result in hung / incomplete requests.

If someone stumbles across this question and can provide a more complete explanation and how it relates to the Java RMI TCP callbacks issue (which I linked to in my question), upvotes for you.

noahlz
  • 9,749
  • 7
  • 53
  • 72
0

I have a similar c ++ server / client environment.

The c ++ client calls a method (attributeDefinitionsAliases) and waits for a response.

The c ++ server starts writing to the socket but locks. Wireshark capture:

enter image description here After closing the c ++ client on the c ++ server, an exception appears:

Thrift internal message: TSocket::write_partial() send() : errno = 10054

Thrift internal message: TConnectedClient died: write() send(): errno = 10054

EDIT 1: It is not a thrift problem. It seems a problem with the way the server starts/launch. I have an application (launcher-app) that starts/launch the server with QProcess (https://doc.qt.io/archives/qt-4.8/qprocess.html), using popen works fine.