I've read this and this. However, my case is different. I don't need a multiplex service on server, nor do I need multiple connections to server.
Background:
For my Big Data project, I need to compute the coreset of a given big data.
Coreset is a subset of the big data that preserves the big data’s most important mathematical relationships.
Workflow:
- Slice huge data to smaller chunks
- Client parses chunk and sends it to server
- Server computes coreset and saves results
My Problem:
The whole thing works as single thread of execution.
Client parses one chunk, then waits for server to finish computing the coreset, then parses another chunk and so on.
Goal:
Exploit multiprocessing. Client parses multiple chunks the same time, and for each compute coreset
request, server task a thread to handle it. Where the number of threads is limited. Something like a Pool.
I understand I need to use a different protocol then TSimpleServer and move towards TThreadPoolServer or perhaps TThreadedServer. I just can't set me mind on which one to choose as both don't seem to suit me ?.
TThreadedServer spawns a new thread for each client connection, and each thread remains alive until the client connection is closed.
In TThreadedServer each client connection gets its own dedicated server thread. Server thread goes back to the thread pool after client closes the connection for reuse.
I don't need a thread per connection, I want a single connection, and the server to handle multiple service requests the same time. Visiualization:
Client:
Thread1: parses(chunk1) --> Request compute coreset
Thread2: parses(chunk2) --> Request compute coreset
Thread3: parses(chunk3) --> Request compute coreset
Server: (Pool of 2 threads)
Thread1: Handle compute Coreset
Thread2: handle compute Coreset
.
.
Thread1 becomes available and handles another compute coreset
Code:
api.thrift:
struct CoresetPoint {
1: i32 row,
2: i32 dim,
}
struct CoresetAlgorithm {
1: string path,
}
struct CoresetWeightedPoint {
1: CoresetPoint point,
2: double weight,
}
struct CoresetPoints {
1: list<CoresetWeightedPoint> points,
}
service CoresetService {
void initialize(1:CoresetAlgorithm algorithm, 2:i32 coresetSize)
oneway void compressPoints(1:CoresetPoints message)
CoresetPoints getTotalCoreset()
}
Server: (Implementation removed for better look)
class CoresetHandler:
def initialize(self, algorithm, coresetSize):
def _add(self, leveledSlice):
def compressPoints(self, message):
def getTotalCoreset(self):
if __name__ == '__main__':
logging.basicConfig()
handler = CoresetHandler()
processor = CoresetService.Processor(handler)
transport = TSocket.TServerSocket(port=9090)
tfactory = TTransport.TBufferedTransportFactory()
pfactory = TBinaryProtocol.TBinaryProtocolFactory()
server = TServer.TThreadedServer(processor, transport, tfactory, pfactory)
# You could do one of these for a multithreaded server
# server = TServer.TThreadedServer(processor, transport, tfactory, pfactory)
# server = TServer.TThreadPoolServer(processor, transport, tfactory, pfactory)
print 'Starting the server...'
server.serve()
print 'done.'
Client:
try:
# Make socket
transport = TSocket.TSocket('localhost', 9090)
# Buffering is critical. Raw sockets are very slow
transport = TTransport.TBufferedTransport(transport)
# Wrap in a protocol
protocol = TBinaryProtocol.TBinaryProtocol(transport)
# Create a client to use the protocol encoder
client = CoresetService.Client(protocol)
# Connect!
transport.open()
// Here data is sliced, and in a loop I move on all files
Saved in the directory I specified, then they are parsed and
client.compressPoints(data) is invoked.
SliceFile(...)
p = CoresetAlgorithm(...)
client.initialize(p, 200)
for filename in os.listdir('/home/tony/DanLab/slicedFiles'):
if filename.endswith(".txt"):
data = _parse(filename)
client.compressPoints(data)
compressedData = client.getTotalCoreset()
# Close!
transport.close()
except Thrift.TException, tx:
print '%s' % (tx.message)
Question:
Is it possible in Thrift ? what protocol should I use ?
I solved the partial problem of client waiting for server to finish computation by adding oneway
to function declaration
to indicates that the client only makes a request and does not wait for any response at all.