3

I am currently working on some simulation code written in C, which runs on different remote machines. While the C part is finished I want to simplify my work by extending it with a python simulation api and some kind of a job-queue system, which should do the following:

1.specifiy a set of parameters on which simulations should be performed and put them into a queue on a host computer

2.perform simulation on remote machines by workers

3.return results to host computer

I had a look at different frameworks for accomplishing this task and my first choice goes down to IPython.parallel. I had a look at the documentation and from what I tested out it seems pretty easy to use. My approach would be to use a load balanced view like explained at

http://ipython.org/ipython-doc/dev/parallel/parallel_task.html#creating-a-loadbalancedview-instance

But what I dont see is:

  • what happens i.e. if the ipcontroller crashes, is my job queue gone?
  • what happens if a remote machine crashes? is there some kind of error handling?

Since I run relatively long simulations (1-2 weeks) I don't want my simulations to fail if some part of the system crashes. So is there maybe some way to handle this in IPython.parallel?

My Second approach would be to use pyzmq and implement the jobsystem from scratch. In this case what would be the best zmq-pattern for this situation?

And last but not least, is there maybe a better framework for this scenario?

jrsm
  • 1,373
  • 2
  • 14
  • 33

2 Answers2

0

What lies behind the curtain is a bit more complex view on how to arrange the work-package flow alongside the ( parallelised ) number-crunching pipeline(s).

Being the work-package of a many CPU-core-week(s),

or

being the lumpsum volume of the job above a few hundred-of-thousands of CPU-core-hours, the principles are similar and follow a common sense.

Key Features

  • scaleability of the computing performance of all resources involved ( ideally a linear one )
  • ease of task submission role
  • fault-resilience of submitted task(s) ( ideally with an automated self-healing )
  • feasible TCO cost of access to / use of a sufficient pool of resources ( upfront co$ts, recurring co$ts, adaptation$ co$ts, co$ts of $peed )

Approaches to Solution

  • home-brew architecture for a distributed massively parallel scheduler based self-healing computation engine

  • re-use of available grid-based computing resources

Based on own experience to solve a need for repetitive runs of numerical intensive optimisation problem over a vast parameterSetVectorSPACE ( which could not be de-composed into any trivialised GPU parallelisation scheme ), selection of the second approach has been validated to be more productive rather than an attempt to burn dozens of man*years in just-another-trial to re-invent a wheel.

Being in academia environment, one may get a way easier to an acceptable access to resources-pool(s) for processing the work-packages, while commercial entities may acquire the same, based on their acceptable budgeting tresholds.


enter image description here

user3666197
  • 1
  • 6
  • 43
  • 77
0

My gut instinct is to suggest rolling your own solution for this, because like you said otherwise you're depending on IPython not crashing.

I would run a simple python service on each node which listens for run commands. When it receives one it launches your C program. However, I suggest you ensure the C program is a true Unix daemon, so when it runs it completely disconnects itself from python. That way if your node python instance crashes you can still get data if the C program executes successfully. Have the C program write the output data to a file or database, and when the task is finished write "finished" to a "status" or something similar. The python service should monitor that file and when finished is indicated it should retrieve the data and send it back to the server.

The central idea of this design is to have as few possible points of failure as possible. As long as the C program doesn't crash, you can still get the data one way or another. As far as handling system crashes, network disconnects, etc, that's up to you.

Community
  • 1
  • 1
anderspitman
  • 6,256
  • 5
  • 35
  • 47