Write a collection to a binary file

Question

I'm searching a way to write some data (List, Array, etc) into a binary file. The collection to put into the binary file represents a list of points. What I try until now : [

11:17]
Welcome to Scala 2.12.0-M3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40).
Type in expressions for evaluation. Or try :help.

scala> import java.io.{FileInputStream, FileOutputStream, ObjectInputStream, ObjectOutputStream}
import java.io.{FileInputStream, FileOutputStream, ObjectInputStream, ObjectOutputStream}

scala> val oos = new ObjectOutputStream(new FileOutputStream("/tmp/f1.data"))
oos: java.io.ObjectOutputStream = java.io.ObjectOutputStream@13bc8645

scala> oos.writeObject(List(1,2,3,4))

scala> oos.close

scala> val ois = new ObjectInputStream(new FileInputStream("/tmp/f1.data"))
ois: java.io.ObjectInputStream = java.io.ObjectInputStream@392a04e7

scala> val s : List[Int] = ois.readObject().asInstanceOf[List[Int]]
s: List[Int] = List(1, 2, 3, 4)

Ok it's working well. The problem is that maybe tomorrow I will need to read this binary file with an another language as Python. Is it a way to have a more generic binary file that can be read by a multiple languages ?

Solution

To the person searching in the same situation, you can do it like that :

def write2binFile(filename : String, a : Array[Int]) = {
      val inChannel = new RandomAccessFile(filename, "rw").getChannel
      val bbufer = ByteBuffer.allocateDirect(a.length * 4)
      val ibuffer = bbufer.asIntBuffer()
      ibuffer.put(a)
      inChannel.write(bbufer)
      inChannel.close
    }

What's the reason you want to use binary? Is it to make the files smaller or to hide their content? Off the top of my head, BSON sounds like a format you might consider. Not sure about library support in Scala/Python but there's gotta be something as MongoDB uses this format. — toniedzwiedz, May 03 '16 at 09:25
The reason is that I want to store the binary files on Amazon S3 and use the aws java API to read a specific range of bytes. So the binary files must not be specific to Java and it's the main reason of my post. — alifirat, May 03 '16 at 09:33
That depends on the binary format, based on your description you also need some kind of index or at least a calculated way to know which elements are stored at which byte index? For cross language binary serialization I'd suggest to take a look at Google Protobuffers. But depending on your use case you need probably something different to make your byte lookups work. — Elmar Weber, May 03 '16 at 09:37
Voting to close as "too broad". There are many ways of doing this and many formats that might work. It also might be argued to be a duplicate of this question: http://stackoverflow.com/questions/1421707/cross-platform-and-language-deserialization — The Archetypal Paul, May 03 '16 at 10:04
@ElmarWeber, I have one JSON file described every binary files (number of samples, startime). The choose format is a sequence of bytes that I can easily extract using the AWS Java API. The main question there is if there is an another way to write my data into a binary file. — alifirat, May 03 '16 at 12:11

Jan Vlcinsky · Answer 1 · 2016-05-03T10:42:06.597

Format for cross-platform sharing of point coordinates allowing selective access by RANGE

Your requirements are:

store data by Scala, read by Python (or other languages)
the data are lists of point coordinates
store the data on AWS S3
allow fetching only part of the data using RANGE request

Data structure to use

The data must be uniform in structure and size per element to allow calculating position of certain part by means of RANGE.

If Scala format for storing lists/arrays fulfils this requirement, and if the binary format is well defined, you may succeed. If not, you have to find another format.

Reading binary data by Python

Assuming the format is known, use Python struct module from stdlib to read it.

Alternative approach: split data to smaller pieces

You are willing to access the data piece by piece, probably expecting one large object on S3 and using HTTP request with RANGE.

Alternative solution is to split the data into smaller pieces, which are of reasonable size for fetching (e.g. 64 kB, but you know your use case better), and design rule for storing them piece by piece on AWS S3. You may even use tree structure for this purpose.

There are some advantages with this approach:

use whatever format you like, e.g. XML, JSON, no need to deal with special binary formats
pieces can be compressed, you will save some costs

Note, that AWS S3 will charge you not only for data transfer, but also per request, so each HTTP request using RANGE will be counted as one.

Cross-platform binary formats to consider

Consider following formats:

BSON
(Google) Result Buffers
HDF5

If you visit Wikipedia page for any of those formats, you will find many links to other formats.

Anyway, I am not aware of any of such formats, which would be using uniform size per element as most of them are trying to keep the size as small as possible. For this reason they cannot be used in scenario using RANGE unless some special index file is introduced (what is probably not very feasible).

On the other hand, using these formats with alternative approach (splitting the data to smaller pieces) shall work.

Note: I did some test in past regarding storage efficiency and speed of encoding/decoding. From practical point of view the best results were achieved using simple JSON structure (possibly compressed). You find these options on every platform, it is very simple to use, speed of encoding/decoding is high (I do not say the hightest).

I don't think this really answers the question, because of this: "Assuming the format is known", and then this "use Python struct module from stdlib to read it.", given the OP talks about "multiple languages". The "alternative" basically just says "roll your own". — The Archetypal Paul, May 03 '16 at 10:24
@TheArchetypalPaul Thanks for feedback on downvoting. The question is, what is the real question and it often differs in title, in text of the question and after comments clarify what is really behind it. I was addressing the last one. Anyway, I added section to my answer addressing the formats and their usability. — Jan Vlcinsky, May 03 '16 at 10:46