1

I figure by the use cases node is designed for that it would have much less overhead than python would when starting up. But can anyone confirm this by way of experience or benchmarks.

The reason I ask is I'm working on a project in which we are starting many, ~20, python scripts every minute. In these processes we are doing tons of IO with databases and the network. This is a place node would shine, so I want to explore the benefits of possibly using node over python here. I know I could use twisted here to do my i/o asynchronously, but I still need to start these processes up every minute.

Edit:

I know that its not always seen as ideal to start processes like described here. But in this architecture for this system this is the ideal solution. This is why I consider node. Because V8 is event driven only, calling "node myscript.js" at a rate of 20/min shouldn't matter. Calling myscript.js IS the event. Its just not in the browser anymore. [edit] Totally Wrong

After driving to work and thinking about this, I guess my question should have been about how best to benchmark code. From there I could profile both designs.

Update:

brandizzi showed me my err. Doing 'node myscript.js' is NOT the event. And in the end, the node interpreter does take a little longer to start than python. However, that amount is pretty much nothing.

sbartell
  • 873
  • 1
  • 7
  • 18
  • If startup cost dominates, you're likely doing something. Are you sure these must be seperate processes? –  Jul 27 '11 at 14:41
  • 2
    Starting scripts at the rate of 20/min (1200/hr) is a bad design. Use `multiprocessing`. Start 20 workers who fetch their tasks from a single queue. Each worker handles one request *without* starting yet another process. – S.Lott Jul 27 '11 at 14:46
  • @S.Lott - Its worth noting that having each worker used multiple times means you have to worry more about the state each unit of work leaves behind. Another approach is to start up a single process at the "base" level (ie, before running any task code) and then fork off a disposable copy for each task. That way, you get the faster "startup time", but don't need to worry about cleaning up after each task. This is particularly useful when the tasks are "user defined" – RHSeeger Jul 27 '11 at 15:18
  • @RHSeeger: "the state each unit of work leaves behind"? That's trivial design. That's why we have "functions" and "objects". Python's namespaces are one honking great idea. What kind of thing are you worried about? – S.Lott Jul 27 '11 at 15:22
  • When the framework (to schedule,run tasks) is written by different people than the tasks, it can become complicated. I don't have an example of such problems to share right now, but consider it this way.... would you want to use an interpreter that had 20 other people you don't know defining new classes, objects, data, etc in it for the past day? A clean interpreter has a lower change of running into issues. Just speaking from past experience is all. – RHSeeger Jul 27 '11 at 15:37
  • 1
    @RHSeeger: "would you want to use an interpreter that had 20 other people you don't know defining new classes, objects, data, etc in it for the past day? " I do. I use Apache, Python, Django, MySQL and Linux. Lots of people I don't know have created a lot of software that I trust has been properly designed. – S.Lott Jul 27 '11 at 15:44
  • @S.Lott, not necessarily a bad design. Whether its a good or bad design depends entirely on the end goal and overall architecture of the system. S.Lot you do have a good point in general, but discussing whether or not this is good design in the context of my question is irrelevant. – sbartell Jul 27 '11 at 16:58
  • @delnan, yes they must be separate processes. This is why I am aiming to reduce startup cost. – sbartell Jul 27 '11 at 17:13
  • 2
    @S.Lott: Even Apache has a pre-fork MPM for compatibility with software using libraries that are not thread safe. Some software requires isolation. Some badly designed software needs to be used. Brushing that software off just because it is "badly designed" is not always a healthy business decision. – André Caron Jul 27 '11 at 17:23
  • "yes they must be separate processes". A bad design if you want to reduce process startup cost. You might want to provide some justification for this, since it is a bad idea to start lots of processes and then complain that things are slow because you're starting lots of processes. Starting a process is an avoidable cost. – S.Lott Jul 27 '11 at 19:39
  • @S.Lott I think its a bad design if it degrades or debilitates the system. In my case it is not. We are only looking to tighten up the load for this architecture. So yes you're right, bitching about how slow a system is when I started lots of processes is a bad idea. However, I'm not doing that. Right now the load isnt bad, we are just looking to manage it. If at some point our design just doesnt scale, then another avenue will have to be taken. – sbartell Jul 27 '11 at 20:44

1 Answers1

4

Your question is too vague IMHO. Anyway, if you want to compare the startup time, why not just time? Look at some examples of it (the files null.* are empty since we are trying to measure the boot time only):

$ i=0
$ time while [ $((i++)) -lt 1000 ] ; do python null.py ; done

real    0m55.777s
user    0m30.154s
sys 0m13.910s

$ i=0
$ time while [ $((i++)) -lt 1000 ] ; do node null.js ; done

real    1m37.618s
user    0m59.578s
sys 0m18.038s

These preliminary results indicate that node is somewhat slower to boot up. (Your statement that "Calling myscript.js IS the event" does not look true to me and my suspicion seems confirmed. myscript.js invocation is an event - but calling it with node myscript.js loads an entire process to just treat this event.)

To be honest, however, the vagueness of your question makes me wonder if you are not trying some time of premature optimization - in other words, it looks like you not even has a problem to be solved yet! I can be wrong, of course, but maybe you do not need to worry about this question for now (even because I started one thousand processes of each interpreter in more or less one minute - if you will start just twenty the start up of the interpreter may be no problem.)

Anyway, my pythonista side would make some suggestions. For example: how much time the processes take to finish? If it is a short time, you really can think about using the Poll of the multiprocessing module, which would create a poll of processes for managing your requirements. Even you you want to invoke your scripts through some kind of shell script (because there is no much difference between in invoking them through a Bash script or the multiprocessing module anyway), I bet Python has the advantage of generating a bytecode .pyc file. Does V8 do it too?

brandizzi
  • 23,668
  • 7
  • 97
  • 145
  • BTW, if you are using Windows, you may want to look to [what I think is the equivalent of `time`](http://stackoverflow.com/q/673523/287976). Sorry for my gaffe, I am too used to manage Unix boxes that I frequently and ungraciously forget to recommend tools for other platforms :-/ – brandizzi Jul 27 '11 at 18:13
  • you bring up a good point that I had overlooked. I would only gain the speed benefit of running node if the interpreter is already running. These scripts run for 1 minute each do quite a bit of I/O during runtime. There is substantial strain on the cpu during runtime because of this amount of IO. About poll, I really want to avoid executing my scripts from one place. My goal is to have them start and die completely separate of each other. BTW I too work solely on Unix :) – sbartell Jul 27 '11 at 19:51