30

I have a number of hdf5 files, each of which have a single dataset. The datasets are too large to hold in RAM. I would like to combine these files into a single file containing all datasets separately (i.e. not to concatenate the datasets into a single dataset).

One way to do this is to create a hdf5 file and then copy the datasets one by one. This will be slow and complicated because it will need to be buffered copy.

Is there a more simple way to do this? Seems like there should be, since it is essentially just creating a container file.

I am using python/h5py.

Bitwise
  • 7,043
  • 4
  • 30
  • 48
  • 1
    Looks like this was answered already: http://stackoverflow.com/questions/5346589/concatenate-a-large-number-of-hdf5-files – Matt Pavelle Aug 28 '13 at 15:34
  • 2
    @MattPavelle as far as I understand this is different from what I want. I do not want to concatenate the datasets into a single dataset, but to keep them as separate datasets within one file. – Bitwise Aug 28 '13 at 15:36
  • 1
    Got it, thanks for the clarification and the edit. And forgive the follow up - it's been a few years since I played with HDF5 - but I assume h5merge doesn't do the trick? – Matt Pavelle Aug 28 '13 at 15:46
  • @MattPavelle Not sure, looking at it now. h5merge does not seem to be part of the official hdf5 tools, and the documentation for it seems kind of poor. I was looking more for a python/h5py solution, but I will also further explore the available hdf5 unix tools. Thanks. – Bitwise Aug 28 '13 at 15:54
  • yeah, it is not an official hdf5 tool - and it's definitely not Pythonic :) but it might be your best bet. – Matt Pavelle Aug 28 '13 at 16:07

6 Answers6

36

This is actually one of the use-cases of HDF5. If you just want to be able to access all the datasets from a single file, and don't care how they're actually stored on disk, you can use external links. From the HDF5 website:

External links allow a group to include objects in another HDF5 file and enable the library to access those objects as if they are in the current file. In this manner, a group may appear to directly contain datasets, named datatypes, and even groups that are actually in a different file. This feature is implemented via a suite of functions that create and manage the links, define and retrieve paths to external objects, and interpret link names:

Here's how to do it in h5py:

myfile = h5py.File('foo.hdf5','a')
myfile['ext link'] = h5py.ExternalLink("otherfile.hdf5", "/path/to/resource")

Be careful: when opening myfile, you should open it with 'a' if it is an existing file. If you open it with 'w', it will erase its contents.

This would be very much faster than copying all the datasets into a new file. I don't know how fast access to otherfile.hdf5 would be, but operating on all the datasets would be transparent - that is, h5py would see all the datasets as residing in foo.hdf5.

Bram Vanroy
  • 22,919
  • 16
  • 101
  • 195
Yossarian
  • 4,687
  • 1
  • 32
  • 54
  • Thanks, that's a nice trick. In my case, though, I prefer them to really be contained in one file. But I might use this method if copying proves to be be too slow. – Bitwise Aug 30 '13 at 12:52
  • 3
    this should be selected as the answer to the question. – ivotron Jun 03 '15 at 22:19
  • If you are going to do this and you have a lot of links, be sure to use H5Pset_libver_bounds() in C or libver='latest' when creating/opening new files in h5py. This will use the latest file format, which is much more efficient for storing large numbers of links. – Dana Robinson Sep 27 '17 at 12:03
  • 2
    I just tried your advice and it made "myfile" that was 98.5GB dataset 960 bytes; which I now have to recreate. No warnings or nothing - poof - 98.5GB gone! – Jeshua Lacock Mar 09 '18 at 22:17
  • @JeshuaLacock that happens when you open it with `'w'`. I think the answer should be `myfile = h5py.File('foo.hdf5', 'a')`. – Yamaneko Sep 30 '18 at 06:14
14

One solution is to use the h5py interface to the low-level H5Ocopy function of the HDF5 API, in particular the h5py.h5o.copy function:

In [1]: import h5py as h5

In [2]: hf1 = h5.File("f1.h5")

In [3]: hf2 = h5.File("f2.h5")

In [4]: hf1.create_dataset("val", data=35)
Out[4]: <HDF5 dataset "val": shape (), type "<i8">

In [5]: hf1.create_group("g1")
Out[5]: <HDF5 group "/g1" (0 members)>

In [6]: hf1.get("g1").create_dataset("val2", data="Thing")
Out[6]: <HDF5 dataset "val2": shape (), type "|O8">

In [7]: hf1.flush()

In [8]: h5.h5o.copy(hf1.id, "g1", hf2.id, "newg1")

In [9]: h5.h5o.copy(hf1.id, "val", hf2.id, "newval")

In [10]: hf2.values()
Out[10]: [<HDF5 group "/newg1" (1 members)>, <HDF5 dataset "newval": shape (), type "<i8">]

In [11]: hf2.get("newval").value
Out[11]: 35

In [12]: hf2.get("newg1").values()
Out[12]: [<HDF5 dataset "val2": shape (), type "|O8">]

In [13]: hf2.get("newg1").get("val2").value
Out[13]: 'Thing'

The above was generated with h5py version 2.0.1-2+b1 and iPython version 0.13.1-2+deb7u1 atop Python version 2.7.3-4+deb7u1 from a more-or-less vanilla install of Debian Wheezy. The files f1.h5 and f2.h5 did not exist prior to executing the above. Note that, per salotz, for Python 3 the dataset/group names need to be bytes (e.g., b"val"), not str.

The hf1.flush() in command [7] is crucial, as the low-level interface apparently will always draw from the version of the .h5 file stored on disk, not that cached in memory. Copying datasets to/from groups not at the root of a File can be achieved by supplying the ID of that group using, e.g., hf1.get("g1").id.

Note that h5py.h5o.copy will fail with an exception (no clobber) if an object of the indicated name already exists in the destination location.

hBy2Py
  • 1,287
  • 19
  • 27
  • 1
    This looks to be potentially a couple of years too late, but... I'll definitely be using it, and hopefully if nothing else it'll help someone else, too. – hBy2Py Jun 03 '15 at 03:48
  • 1
    Thanks! Actually this question gets votes every now and then, so I am guessing it is still useful for many people. – Bitwise Jun 03 '15 at 12:38
  • Cool. HDF5 is a really nice data format, but its high-level API is far from... exhaustive. – hBy2Py Jun 03 '15 at 13:13
  • 1
    I'm using `h5py` 2.7.1 and python 3.6.5 and the strings needed to be bytes, so replace: `h5.h5o.copy(hf1.id, "g1", hf2.id, "newg1")` with `h5.h5o.copy(hf1.id, b"g1", hf2.id, b"newg1")` – salotz Jun 18 '18 at 21:18
11

I found a non-python solution by using h5copy from the official hdf5 tools. h5copy can copy individual specified datasets from an hdf5 file into another existing hdf5 file.

If someone finds a python/h5py-based solution I would be glad to hear about it.

Bitwise
  • 7,043
  • 4
  • 30
  • 48
2

I usually use ipython and h5copy tool togheter, this is much faster compared to a pure python solution. Once installed h5copy.

Console solution M.W.E.

#PLESE NOTE THIS IS IPYTHON CONSOLE CODE NOT PURE PYTHON

import h5py
#for every dataset Dn.h5 you want to merge to Output.h5 
f = h5py.File('D1.h5','r+') #file to be merged 
h5_keys = f.keys() #get the keys (You can remove the keys you don't use)
f.close() #close the file
for i in h5_keys:
        !h5copy -i 'D1.h5' -o 'Output.h5' -s {i} -d {i}

Automated console solution

To completely automatize the process supposing you are working in the folder were the files to be merged are stored:

import os 
d_names = os.listdir(os.getcwd())
d_struct = {} #Here we will store the database structure
for i in d_names:
   f = h5py.File(i,'r+')
   d_struct[i] = f.keys()
   f.close()

# A) empty all the groups in the new .h5 file 
for i in d_names:
    for j  in d_struct[i]:
        !h5copy -i '{i}' -o 'output.h5' -s {j} -d {j}

Create a new group for every .h5 file added

If you want to keep the previous dataset separate inside the output.h5, you have to create the group first using the flag -p:

 # B) Create a new group in the output.h5 file for every input.h5 file
 for i in d_names:
        dataset = d_struct[i][0]
        newgroup = '%s/%s' %(i[:-3],dataset)
        !h5copy -i '{i}' -o 'output.h5' -s {dataset} -d {newgroup} -p
        for j  in d_struct[i][1:]:
            newgroup = '%s/%s' %(i[:-3],j) 
            !h5copy -i '{i}' -o 'output.h5' -s {j} -d {newgroup}
G M
  • 14,123
  • 7
  • 67
  • 66
2

To update on this, with HDF5 version 1.10 comes a new feature that might be useful in this context called "Virtual Datasets".
Here you find a brief tutorial and some explanations: Virtual Datasets.
Here more complete and detailed explanations and documentation for the feature:
Virtual Datasets extra doc.
And here the merged pull request in h5py to include the virtual datatsets API into h5py:
h5py Virtual Datasets PR but I don't know if it's already available in the current h5py version or will come later.

fedepad
  • 4,229
  • 1
  • 9
  • 24
  • Creating a virtual dataset would (virtually) concatenate the datasets, though, which is not what the original poster wanted to do. – Dana Robinson Sep 27 '17 at 12:07
0

To use Python (and not IPython) and h5copy to merge HDF5 files, we can build on GM's answer:

import h5py
import os

d_names = os.listdir(os.getcwd())
d_struct = {} #Here we will store the database structure
for i in d_names:
   f = h5py.File(i,'r+')
   d_struct[i] = f.keys()
   f.close()

for i in d_names:
   for j  in d_struct[i]:
      os.system('h5copy -i %s -o output.h5 -s %s -d %s' % (i, j, j))
Skandix
  • 1,528
  • 5
  • 20
  • 29
zilba25
  • 1
  • 1