-1

I'm trying to recreate an HDF5 file using h5py that stores binary data (e.g. JPEG compressed images) as an OPAQUE dataset using the tag to store the MIME type so they can be easily decoded later.

The only way I've been able to solve this is using the low-level API, but it would be nice if there was something higher level!

(I'm posting my solution an answer in the hope it's useful to other people as I struggled to find many examples of doing this sort of thing)

Sam Mason
  • 11,177
  • 1
  • 29
  • 42
  • I'm curious - why are you saving as an Opaque dataset? Image data examples I've seen save the images as np.array of various dtypes and shapes. – kcw78 Sep 23 '20 at 17:35
  • @kcw78 jpeg encoding is to save space and opaque datatype seemed appropriate. expecting to archive ~100k files, each file having ~500 images + related camera/experiment info. – Sam Mason Sep 23 '20 at 19:02
  • @kcw78 might have misinterpreted; the format was designed for writing by C++ code (which will produce most of the files) I wanted some python code to do the same for testing and posted the question because it seemed more awkward than I expected – Sam Mason Sep 23 '20 at 19:20

2 Answers2

1

An easier way to solve this issue and avoid the low-level API could be HDFql. In Python using HDFql, it could be solved as follows:

# import HDFql package
import HDFql

# get size (in bytes) of file 'input.jpeg'
HDFql.execute("SHOW FILE SIZE input.jpeg")

# move cursor to first element
HDFql.cursor_first()

# get cursor element and assign it to variable
input_size = HDFql.cursor_get_unsigned_bigint()

# create HDF5 file 'output.h5'
HDFql.execute("CREATE FILE output.h5")

# create dataset 'mydata' (in file 'output.h5') of data type opaque with a tag 'image/jpeg' and storing the content of file 'input.jpeg'
HDFql.execute("CREATE DATASET output.h5 mydata AS OPAQUE(%d) TAG image/jpeg VALUES FROM BINARY FILE input.jpeg" % input_size)
SOG
  • 744
  • 4
  • 6
  • not heard of HDFql before, looks interesting! is there a way to do this without using the local file system? AFAICT I could use `variable_transient_register` – Sam Mason Sep 24 '20 at 10:19
  • @SamMason: not sure what you mean by local file system. Concerning the usage of `variable_transient_register`, yes, you could use it if you have a NumPy array filled with the content that you want to store in dataset `mydata`. You could then do the following instead: `HDFql.execute("CREATE DATASET output.h5 mydata AS OPAQUE(%d) TAG image/jpeg VALUES FROM MEMORY %d" % (input_size, HDFql.variable_transient_register(my_numpy_array)))`. – SOG Sep 24 '20 at 15:34
  • yup, sorry for the ambiguity, you interpreted correctly. is it valid to pass `bytes` in? I'm struggling to find a repo (e.g. github) for HDFql so I can check myself, any pointers? – Sam Mason Sep 24 '20 at 15:59
0

The only way I've found of doing this is using the low level API. This means we need to set up data types and data spaces ourselves before we can create the dataset and write the data in.

import h5py
import numpy as np

# get the binary data in
with open('input.jpeg', 'rb') as fd:
  data = fd.read()

# set up an HDF5 type appropriately sized for our data
dtype = h5py.h5t.create(h5py.h5t.OPAQUE, len(data))
dtype.set_tag(b'image/jpeg')

# set up a simple scalar HDF5 data space
space = h5py.h5s.create(h5py.h5s.SCALAR)

with h5py.File('output.h5', 'w') as root:
  ds = h5py.h5d.create(root.id, b'mydata', dtype, space)

  ds.write(space, space, np.frombuffer(data, dtype=np.uint8), dtype)

this works for me, with h5dump -H output.h5 giving:

HDF5 "output.h5" {
GROUP "/" {
   DATASET "mydata" {
      DATATYPE  H5T_OPAQUE {
         OPAQUE_TAG "image/jpeg";
      }
      DATASPACE  SCALAR
   }
}
}

but it would be nice if this was a bit easier!

Sam Mason
  • 11,177
  • 1
  • 29
  • 42