Read HDF5 file into numpy array

Question

I have the following code to read a hdf5 file as a numpy array:

hf = h5py.File('path/to/file', 'r')
n1 = hf.get('dataset_name')
n2 = np.array(n1)

and when I print n2 I get this:

Out[15]:
array([[<HDF5 object reference>, <HDF5 object reference>,
        <HDF5 object reference>, <HDF5 object reference>...

How can I read the HDF5 object reference to view the data stored in it?

score 18 · Answer 1 · edited Jul 28 '20 at 09:44

The easiest thing is to use the .value attribute of the HDF5 dataset.

>>> hf = h5py.File('/path/to/file', 'r')
>>> data = hf.get('dataset_name').value # `data` is now an ndarray.

You can also slice the dataset, which produces an actual ndarray with the requested data:

>>> hf['dataset_name'][:10] # produces ndarray as well

But keep in mind that in many ways the h5py dataset acts like an ndarray. So you can pass the dataset itself unchanged to most, if not all, NumPy functions. So, for example, this works just fine: np.mean(hf.get('dataset_name')).

EDIT:

I misunderstood the question originally. The problem isn't loading the numerical data, it's that the dataset actually contains HDF5 references. This is a strange setup, and it's kind of awkward to read in h5py. You need to dereference each reference in the dataset. I'll show it for just one of them.

First, let's create a file and a temporary dataset:

>>> f = h5py.File('tmp.h5', 'w')
>>> ds = f.create_dataset('data', data=np.zeros(10,))

Next, create a reference to it and store a few of them in a dataset.

>>> ref_dtype = h5py.special_dtype(ref=h5py.Reference)
>>> ref_ds = f.create_dataset('data_refs', data=(ds.ref, ds.ref), dtype=ref_dtype)

Then you can read one of these back, in a circuitous way, by getting its name ,and then reading from that actual dataset that is referenced.

>>> name = h5py.h5r.get_name(ref_ds[0], f.id) # 2nd argument is the file identifier
>>> print(name)
b'/data'
>>> out = f[name]
>>> print(out.shape)
(10,)

It's round-about, but it seems to work. The TL;DR is: get the name of the referenced dataset, and read directly from that.

Note:

The h5py.h5r.dereference function seems pretty unhelpful here, despite the name. It returns the ID of the referenced object. This can be read from directly, but it's very easy to cause a crash in this case (I did it several times in this contrived example here). Getting the name and reading from that is much easier.

Note 2:

As stated in the release notes for h5py 2.1, the use of Dataset.value property is deprecated and should be replaced by using mydataset[...] or mydataset[()] as appropriate.

The property Dataset.value, which dates back to h5py 1.0, is deprecated and will be removed in a later release. This property dumps the entire dataset into a NumPy array. Code using .value should be updated to use NumPy indexing, using mydataset[...] or mydataset[()] as appropriate.

I'm trying that but I still get the same `HDF5 object reference` when I print the `data` variable — e9e9s, Oct 13 '17 at 15:35
Ahh, I think I know what's going on. The dataset you're trying to load is actually made up of HDF5 references. It's not numerical data. You can verify this by doing `h5ls` or `h5dump` on the file. In this case, I don't know how you can read from the referenced dataset in `h5py`. — bnaecker, Oct 13 '17 at 15:38
It looks like you can use the `h5py.H5R` module to dereference the dataset. Can you try: `h5py.h5r.dereference(hf['dataset_name'])`? — bnaecker, Oct 13 '17 at 15:42
When I try that, I get this error message `TypeError: dereference() takes exactly 2 positional arguments (1 given)` — e9e9s, Oct 13 '17 at 15:46
When I list the keys by doing `with h5py.File('path/to/file, 'r') as hdf: ls = list(hdf.keys()) print('List of datasets in this file: \n', ls)` I get `List of datasets in this file: ['#refs#', 'data_set']` Not sure if this would help or not — e9e9s, Oct 13 '17 at 15:48
I'm still unable to read the `HDF5 object reference`. Sorry but I can't see how your example is relevant to my question. I've been trying for days and will post an answer here so others don't have to go through this. — e9e9s, Oct 17 '17 at 14:08
You can use the name of the referenced object to access it directly in the file, which is the relevance of my answer to your question. But Pierre de Buyl's answer is actually easier. — bnaecker, Oct 18 '17 at 05:28

score 8 · Answer 2 · answered Feb 02 '18 at 02:05

8

Here is a direct approach to read hdf5 file as a numpy array:

import numpy as np
import h5py

hf = h5py.File('path/to/file.h5', 'r')
n1 = np.array(hf["dataset_name"][:]) #dataset_name is same as hdf5 object name 

print(n1)

answered Feb 02 '18 at 02:05

spate

81
1
2

ArcherEX · Answer 3 · 2018-06-11T14:50:55.907

h5py provides intrinsic method for such tasks: read_direct()

hf = h5py.File('path/to/file', 'r')
n1 = np.zeros(shape, dtype=numpy_type)
hf['dataset_name'].read_direct(n1)
hf.close()

The combined steps are still faster than n1 = np.array(hf['dataset_name']) if you %timeit. The only drawback is, one needs to know the shape of the dataset beforehand, which can be assigned as an attribute by the data provider.

Pierre de Buyl · Answer 4 · 2017-10-17T19:13:15.877

HDF5 has a simple object model for storing datasets (roughly speaking, the equivalent of an "on file array") and organizing those into groups (think of directories). On top of these two objects types, there are much more powerful features that require layers of understanding.

The one at hand is a "Reference". It is an internal address in the storage model of HDF5.

h5py will do all the work for you without any calls to obscure routines, as it tries to follow as much as possible a dict-like interface (but for references, it is a bit more complex to make it transparent).

The place to look for in the docs is Object and Region References. It states that to access an object pointed to by reference ref, you do

 my_object = my_file[ref]

In your problems, there are two steps: 1. Get the reference 2. Get the dataset

# Open the file
hf = h5py.File('path/to/file', 'r')
# Obtain the dataset of references
n1 = hf['dataset_name']
# Obtain the dataset pointed to by the first reference
ds = hf[n1[0]]
# Obtain the data in ds
data = ds[:]

If the dataset containing references is 2D, for instance, you must use

ds = hf[n1[0,0]]

If the dataset is scalar, you must use

data = ds[()]

To obtain the all the datasets at once:

all_data = [hf[ref] for ref in n1[:]]

assuming a 1D dataset for n1. For 2D, the idea holds but I don't see a short way to write it.

To get a full idea of how to roundtrip data with references, I wrote short "writer program" and a short "reader program":

import numpy as np
import h5py

# Open file                                                                                    
myfile = h5py.File('myfile.hdf5', 'w')

# Create dataset                                                                               
ds_0 = myfile.create_dataset('dataset_0', data=np.arange(10))
ds_1 = myfile.create_dataset('dataset_1', data=9-np.arange(10))

# Create a data                                                                                
ref_dtype = h5py.special_dtype(ref=h5py.Reference)

ds_refs = myfile.create_dataset('ref_to_dataset', shape=(2,), dtype=ref_dtype)

ds_refs[0] = ds_0.ref
ds_refs[1] = ds_1.ref

myfile.close()

and

import numpy as np
import h5py

# Open file                                                                                    
myfile = h5py.File('myfile.hdf5', 'r')

# Read the references                                                                          
ref_to_ds_0 = myfile['ref_to_dataset'][0]
ref_to_ds_1 = myfile['ref_to_dataset'][1]

# Read the dataset                                                                             
ds_0 = myfile[ref_to_ds_0]
ds_1 = myfile[ref_to_ds_1]

# Read the value in the dataset                                                                
data_0 = ds_0[:]
data_1 = ds_1[:]

myfile.close()

print(data_0)
print(data_1)

You will notice that you cannot use the standard convenient and easy NumPy like syntax for reference datasets. This is because HDF5 references are not representable with the NumPy datatypes. They must be read and written one at a time.

score 3 · Answer 5 · answered Oct 13 '17 at 15:30

3

Hi this is the way I use to read hdf5 data, hope it could be usefull to you

with h5py.File('name-of-file.h5', 'r') as hf:
    data = hf['name-of-dataset'][:]

answered Oct 13 '17 at 15:30

Yannick Guéhenneux

131
4

score 1 · Answer 6 · answered Oct 30 '19 at 09:13

I tried all the answers suggested previously but none of them worked for me. For example, read_direct() method gives an error 'Operation not defined for data type class'. The .value method also does not work. After a lot of struggling I could get around with using the reference itself to get the numpy array.

import numpy as np
import h5py
f = h5py.File('file.mat','r')
data2get = f.get('data2get')[:]

data = np.zeros([data2get.shape[1]])
for i in range(data2get.shape[1]):
    data[i]  = np.array(f[data2get[0][i]])[0][0]

Read HDF5 file into numpy array

6 Answers6

Linked