25

I want the following

  • During startup, the master process loads a large table from file and saves it into a shared variable. The table has 9 columns and 12 million rows, 432MB in size.
  • The worker processes run HTTP server, accepting real-time queries against the large table.

Here is my code, which obviously does not achieve my goal.

var my_shared_var;
var cluster = require('cluster');
var numCPUs = require('os').cpus().length;

if (cluster.isMaster) {
  // Load a large table from file and save it into my_shared_var,
  // hoping the worker processes can access to this shared variable,
  // so that the worker processes do not need to reload the table from file.
  // The loading typically takes 15 seconds.
  my_shared_var = load('path_to_my_large_table');

  // Fork worker processes
  for (var i = 0; i < numCPUs; i++) {
    cluster.fork();
  }
} else {
  // The following line of code actually outputs "undefined".
  // It seems each process has its own copy of my_shared_var.
  console.log(my_shared_var);

  // Then perform query against my_shared_var.
  // The query should be performed by worker processes,
  // otherwise the master process will become bottleneck
  var result = query(my_shared_var);
}

I have tried saving the large table into MongoDB so that each process can easily access to the data. But the table size is so huge that it takes MongoDB about 10 seconds to complete my query even with an index. This is too slow and not acceptable for my real-time application. I have also tried Redis, which holds data in memory. But Redis is a key-value store and my data is a table. I also wrote a C++ program to load the data into memory, and the query took less than 1 second, so I want to emulate this in node.js.

Jacky Lee
  • 1,093
  • 3
  • 13
  • 21
  • Is `memcached` a suitable choice for this data? – sarnold Jun 09 '12 at 23:04
  • If your set grows, you might want to reconsider optimizing the data structure or the query for database software. Furthermore, Node.js would be terrible language choice for a database system while your C++ program could be good enough. – Shane Hsu Mar 02 '17 at 00:39

5 Answers5

15

If I translate your question in a few words, you need to share data of MASTER entity with WORKER entity. It can be done very easily using events:

From Master to worker:

worker.send({json data});    // In Master part

process.on('message', yourCallbackFunc(jsonData));    // In Worker part

From Worker to Master:

process.send({json data});   // In Worker part

worker.on('message', yourCallbackFunc(jsonData));    // In Master part

I hope this way you can send and receive data bidirectionally. Please mark it as answer if you find it useful so that other users can also find the answer. Thanks

Shivam
  • 1,996
  • 1
  • 16
  • 17
  • The questioner is asking about a "large data with millions of rows". Your answer may not work here. – Mopparthy Ravindranath Oct 31 '16 at 12:07
  • @MupparthyRavindranath ... My answer explains how one can share data between Master and Worker process. If it is db that is creating problem then questioner should try to normalize it as much as possible or questioner should share the query statements / db structure so that we can provide solution in that direction. – Shivam Nov 02 '16 at 06:59
  • I believe the information is relevant. If query is done on the master process, it will only send back relevant data, far less than the full data set. This could work through IPC. That data will need to be sent via HTTP anyways, IPC won't be the bottleneck. Suggesting other databases are weird since it's quite clear the OP is describing master as a database system. – Shane Hsu Mar 02 '17 at 00:37
  • 1
    This isn't a "shared variable", its an entirely new copy of data already stored in memory, which defeats the purpose of being able to access the same location in memory from another worker. The distinction has important implications on how much RAM you need. Additionally, this is terribly inefficient because the data goes through JSON.parse() and JSON.stringify() methods, both of which block the event loop... – de Raad Oct 06 '17 at 08:18
  • @Shivam something like: https://github.com/jxcore/jxcore or https://github.com/SyntheticSemantics/ems – de Raad Oct 07 '17 at 02:04
9

You are looking for shared memory, which node.js just does not support. You should look for alternatives, such as querying a database or using memcached.

Martin Blech
  • 11,657
  • 6
  • 28
  • 33
  • 3
    There are very many node.js npm modules and some of them do support shared memory, e.g. https://www.npmjs.org/search?q=shared+memory – simonhf Apr 13 '14 at 18:23
  • Almost 4 years later.. @Martin Blech I got a [question for you](http://stackoverflow.com/questions/32400108/using-tcp-for-memory-sharing-across-processes)! – NiCk Newman Sep 04 '15 at 14:37
  • **VOTE HERE:** https://github.com/nodejs/help/issues/560 . It's because no one is voting that it's **still** yet implemented. – Pacerier Jul 21 '17 at 14:51
6

In node.js fork works not like in C++. It's not copy current state of process, it's run new process. So, in this case variables isn't shared. Every line of code works for every process but master process have cluster.isMaster flag set to true. You need to load your data for every worker processes. Be careful if your data is really huge because every process will have its own copy. I think you need to query parts of data as soon as you need them or wait if you realy need it all in memory.

Vadim Baryshev
  • 22,958
  • 4
  • 51
  • 46
6

If read-only access is fine for your application, try out my own shared memory module. It uses mmap under the covers, so data is loaded as it's accessed and not all at once. The memory is shared among all processes on the machine. Using it is super easy:

const Shared = require('mmap-object')

const shared_object = new Shared.Open('table_file')

console.log(shared_object.property)

It gives you a regular object interface to a key-value store of strings or numbers. It's super fast in my applications.

There is also an experimental read-write version of the module available for testing.

Allen Luce
  • 6,442
  • 2
  • 33
  • 47
  • [A contributor](https://github.com/druide) added bits to get it compiling under MSVS a while back. I haven't tested it recently and don't have handy access to a Windows build environment. – Allen Luce Jul 21 '17 at 21:28
2

You can use Redis.

Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs.

redis.io

Reza Roshan
  • 152
  • 3
  • 5
  • 1
    Is this even gonna work?.. wouldn't you still need to pass data from Redis to Node, effectively defeating the purpose of shared memory? – Pacerier Jul 21 '17 at 14:53
  • Yes it is working perfectly. You can get data from Redis anywhere (node block codes) you need. – Reza Roshan Jul 24 '17 at 08:45
  • 2
    Nono I mean, don't you need to make a **copy**? If you do, then its no longer true shared memory. – Pacerier Aug 06 '17 at 22:15