3

I've read this and this. However, my case is different. I don't need a multiplex service on server, nor do I need multiple connections to server.

Background:
For my Big Data project, I need to compute the coreset of a given big data. Coreset is a subset of the big data that preserves the big data’s most important mathematical relationships.

Workflow:

  • Slice huge data to smaller chunks
  • Client parses chunk and sends it to server
  • Server computes coreset and saves results

My Problem:
The whole thing works as single thread of execution. Client parses one chunk, then waits for server to finish computing the coreset, then parses another chunk and so on.

Goal:
Exploit multiprocessing. Client parses multiple chunks the same time, and for each compute coreset request, server task a thread to handle it. Where the number of threads is limited. Something like a Pool.

I understand I need to use a different protocol then TSimpleServer and move towards TThreadPoolServer or perhaps TThreadedServer. I just can't set me mind on which one to choose as both don't seem to suit me ?.

TThreadedServer spawns a new thread for each client connection, and each thread remains alive until the client connection is closed.


In TThreadedServer each client connection gets its own dedicated server thread. Server thread goes back to the thread pool after client closes the connection for reuse.

I don't need a thread per connection, I want a single connection, and the server to handle multiple service requests the same time. Visiualization:

Client:
Thread1: parses(chunk1) --> Request compute coreset
Thread2: parses(chunk2) --> Request compute coreset
Thread3: parses(chunk3) --> Request compute coreset

Server: (Pool of 2 threads)
Thread1: Handle compute Coreset
Thread2: handle compute Coreset
.
. 
Thread1 becomes available and handles another compute coreset

Code:
api.thrift:

struct CoresetPoint {
    1: i32 row,
    2: i32 dim,
}

struct CoresetAlgorithm {
    1: string path,
}

struct CoresetWeightedPoint {
    1: CoresetPoint point,
    2: double weight,
}

struct CoresetPoints {
    1: list<CoresetWeightedPoint> points,
}

service CoresetService {

    void initialize(1:CoresetAlgorithm algorithm, 2:i32 coresetSize)

    oneway void compressPoints(1:CoresetPoints message)

    CoresetPoints getTotalCoreset()
}


Server: (Implementation removed for better look)

class CoresetHandler:
    def initialize(self, algorithm, coresetSize):

    def _add(self, leveledSlice):

    def compressPoints(self, message):

    def getTotalCoreset(self):


if __name__ == '__main__':
    logging.basicConfig()
    handler = CoresetHandler()
    processor = CoresetService.Processor(handler)
    transport = TSocket.TServerSocket(port=9090)
    tfactory = TTransport.TBufferedTransportFactory()
    pfactory = TBinaryProtocol.TBinaryProtocolFactory()

    server = TServer.TThreadedServer(processor, transport, tfactory, pfactory)

    # You could do one of these for a multithreaded server
    # server = TServer.TThreadedServer(processor, transport, tfactory, pfactory)
    # server = TServer.TThreadPoolServer(processor, transport, tfactory, pfactory)

    print 'Starting the server...'
    server.serve()
    print 'done.'


Client:

try:
    # Make socket
    transport = TSocket.TSocket('localhost', 9090)

    # Buffering is critical. Raw sockets are very slow
    transport = TTransport.TBufferedTransport(transport)

    # Wrap in a protocol
    protocol = TBinaryProtocol.TBinaryProtocol(transport)

    # Create a client to use the protocol encoder
    client = CoresetService.Client(protocol)

    # Connect!
    transport.open()


    // Here data is sliced, and in a loop I move on all files 
       Saved in the directory I specified, then they are parsed and
       client.compressPoints(data) is invoked.

       SliceFile(...)
       p = CoresetAlgorithm(...)
       client.initialize(p, 200)
       for filename in os.listdir('/home/tony/DanLab/slicedFiles'):
           if filename.endswith(".txt"):
               data = _parse(filename)
               client.compressPoints(data)
       compressedData = client.getTotalCoreset()


# Close!
    transport.close()

except Thrift.TException, tx:
    print '%s' % (tx.message)

Question: Is it possible in Thrift ? what protocol should I use ? I solved the partial problem of client waiting for server to finish computation by adding oneway to function declaration to indicates that the client only makes a request and does not wait for any response at all.

Community
  • 1
  • 1
Tony
  • 11,962
  • 7
  • 41
  • 74

1 Answers1

2

From its nature this is more of an architectural problem, not so much a Thrift problem. Given the premises

I don't need a thread per connection, I want a single connection, and the server to handle multiple service requests the same time. Visiualization:

and

I solved the partial problem of client waiting for server to finish computation by adding oneway to function declaration to indicates that the client only makes a request and does not wait for any response at all.

are describing the use case accurately, you want this:

+---------------------+
| Client              |
+---------+-----------+
          |
          |
+---------v-----------+
| Server              |
+---------+-----------+
          |
          |
+---------v-----------+          +---------------------+
| Queue<WorkItems>    <----------+ Worker Thread Pool  |
+---------------------+          +---------------------+

The servers only task is to get the requests and as quickly as possible insert them into the work items queue. These work items are handled by an independent worker thread pool, which is otherwise entirely independent of the server part. The only shared part is the work items queue, which of course needs properly synced access methods.

Regarding the serevr choice: If the server serves requests quickly enough, even a TSimpleServer may do.

JensG
  • 12,102
  • 4
  • 40
  • 51
  • Ok got it, your approach indeed is better than mine. Server queues the work, worker then take a chunk and does the computation. One more question please, If I multitask the parsing in the client (it is time consuming so I want to multitask it), I might have `client.compressPoints(message)` the same time by different threads in the same client. Is that prone to problems ? – Tony Mar 03 '17 at 10:13
  • Usually clients are not thread safe, nor are the underlying physical connection handles. IOW, one client per thread. If the clients are leaving the connection open, then you will need an adequate number of server endpoints, hence the `TSimpleSever` will no longer be suitable in such a scenario. – JensG Mar 03 '17 at 11:53