3

What's the best way to exchange data moderately large amounts of data (multiple megabytes, but not gigabytes) between UNIX processes?

I think, it would be memory mapped files, since size limitations seem tolerable enough.

I need bidirectional communication, so common pipes won't help. And with sockets and the UDP there are size limitations, as far as I know (also see here). Not sure, if TCP would be a good idea to communicate between child and parent process of a fork().

Reading related questions such as this, some people have recommended shared memory / mmap and others have recommended sockets.

Is there something else I should look into? For example, is there some higher level library that helps with IPC by providing e.g. XML serialization/deserialization of data?

Edit due to comments:

In my special case, there is a parent/controller process and several children (can't use threads). The controller provides children - on request - with some key data which could probably fit into one UDP package. Children act on the key data and provide the controller with information based upon the keys (size of the information can be 10-100MB).

Issues: Size of the respone data, mechanism to inform parent on key request, synchronization - parent has to delete key from its list after passing to child, no duplicate key processing should occur.

Boost and other third-party libraries (unfortunately) must not be used. I might be able to use libraries provided by the SunOS 5.10 system.

Community
  • 1
  • 1
msi
  • 1,786
  • 3
  • 15
  • 24
  • You could have two pipes, one for each direction. – Basile Starynkevitch Jul 04 '13 at 05:38
  • 1
    What counts as 'large size'? Megabytes, gigabytes, terabytes, larger? Plain files or memory mapped files are plausible candidates... – Jonathan Leffler Jul 04 '13 at 05:38
  • http://en.wikipedia.org/wiki/Shared_memory#Support_on_UNIX_platforms – Jim Balter Jul 04 '13 at 05:42
  • "large size" -> more than a few megs, but somewhat smaller that 1 GB. – msi Jul 04 '13 at 05:44
  • Probably using JSON (instead of XML) or some binary serialization format (XDR) is faster than XML – Basile Starynkevitch Jul 04 '13 at 05:45
  • 1
    Oh — baby big stuff, then. :D How much of the data will be modified at a time? Do you need to keep a record between runs of the program? Is the write pattern random? Is the read pattern random? Is the read pattern correlated with the write pattern (are you going to treat it as a humungous circular buffer, or could the reader be reading all over the place while the writer is writing to other places)? – Jonathan Leffler Jul 04 '13 at 05:46
  • modification: unknown - worst case: all of it. no records between runs, random read/write patterns (uncorrelated, but must of the time, once process A has written into the e.g. mmap, process B will take over the data for 99% of its lifetime - so little semaphoring should occur). – msi Jul 04 '13 at 05:54
  • 1
    With no need for persistence, I'd probably use shared memory rather than memory-mapped files (since files provide persistence). You'll just have to be careful to coordinate access to any given area between reader and writer. You said the data flow was bidirectional, so you'll potentially need mechanisms for A to notify B when (and where) there is new data, and for B to notify A when (and where) there is new data. – Jonathan Leffler Jul 04 '13 at 05:58

3 Answers3

4

Sockets. You don't have to protect memory while writing or reading it with a lock or some such to make it parallel-execution safe. An additional benefit is that you can very easily split your code into two separate executables and execute these on different machines, using the socket communication code for the network communication, too.

In my eyes, the major downside is that you'll have to find a scheme in which to (de-)serialize your data and split/assemble data chunks into/from several packets.

arne
  • 3,975
  • 24
  • 42
  • I think the serialization/deserialization burden can be alleviated with `boost::serialization`. This is used in `boost::mpi`, which *could* also provide a solution to OP's problem. – juanchopanza Jul 04 '13 at 05:49
  • @juanchopanza You're right. I haven't used that yet, so I didn't think of it. – arne Jul 04 '13 at 05:50
  • I suppose it depends on what you mean by a large dataset and how quickly you need it. I would consider socket interfaces to be too slow and require too much request/response latency. As well as potentially not being able to keep the entire dataset in memory 2-3 times simultaneously and having to re-stream data because you had to discard it previously. – xaxxon Jul 04 '13 at 05:53
  • Besides serialization: Isn't it possible to lose packages when using UDP? So, when using sockets, would you recommend TCP? – msi Jul 04 '13 at 07:03
  • @bcml: If you want to transfer your data over the net, TCP would be the way to go. Otherwise, use Unix Domain sockets (https://en.wikipedia.org/wiki/Unix_domain_socket) – arne Jul 04 '13 at 08:14
2

Shared memory is fast, but leaves all the burden of coordinating access to the memory on you. You'll probably need to set up (at least) a mutex to assure that only one of the processes writes to the shared memory at any given time (and, obviously enough, ensure both processes use that mutex correctly).

Along with that, you'll (again, probably) need to set up some structures in the memory so a receiving process knows what new data has been written, where it resides, etc.

Sockets limit the amount of data that can be sent in an individual packet, but not the total amount you can send overall. Also note that Unix domain sockets are basically shared memory, with all the coordination handled in the kernel, so it's generally quite fast.

There are quite a few libraries/protocols for IPC serialization -- Boost Serialization, Sun XDR, Google protocol buffers, etc. Given that you're exchanging quite a bit of data, I'd tend to lean toward something other than XML. XML encoding and parsing tend to be relatively slow, and often expand the data quite a bit as well.

Jerry Coffin
  • 437,173
  • 71
  • 570
  • 1,035
1

shared memory. You can read/write it extremely quickly and it's always available for both sides, no request response required.

http://en.wikipedia.org/wiki/Shared_memory

More importantly, you only have to keep one copy of your dataset, so you don't end up with 2 copies plus data in flight - and if you can't afford 2 copies, you're likely to have to stream the same data multiple times with other solutions.

xaxxon
  • 17,640
  • 4
  • 42
  • 70
  • 2
    But if you have two processes modifying shared memory, you must coordinate access lest you run into conflicts as the processes modify the same memory at the same time. – Jonathan Leffler Jul 04 '13 at 05:44
  • of course you do. But you can share very large datasets at exceptionally high speeds. It's worth the tradeoff if you KNOW it's always going to be on the same box. – xaxxon Jul 04 '13 at 05:52