h5py: Correct way to slice array datasets

Question

I'm a bit confused here:

As far as I have understood, h5py's .value method reads an entire dataset and dumps it into an array, which is slow and discouraged (and should be generally replaced by [()]. The correct way is to use numpy-esque slicing.

However, I'm getting irritating results (with h5py 2.2.1):

import h5py
import numpy as np
>>> file = h5py.File("test.hdf5",'w')
# Just fill a test file with a numpy array test dataset
>>> file["test"] = np.arange(0,300000)

# This is TERRIBLY slow?!
>>> file["test"][range(0,300000)]
array([     0,      1,      2, ..., 299997, 299998, 299999])
# This is fast
>>> file["test"].value[range(0,300000)]
array([     0,      1,      2, ..., 299997, 299998, 299999])
# This is also fast
>>> file["test"].value[np.arange(0,300000)]
array([     0,      1,      2, ..., 299997, 299998, 299999])
# This crashes
>>> file["test"][np.arange(0,300000)]

I guess that my dataset is so small that .value doesn't hinder performance significantly, but how can the first option be that slow? What is the preferred version here?

Thanks!

UPDATE It seems that I wasn't clear enough, sorry. I do know that .value copies the whole dataset into memory while slicing only retrieves the appropiate subpart. What I'm wondering is why slicing in file is slower than copying the whole array and then slicing in memory. I always thought hdf5/h5py was implemented specifically so that slicing subparts would always be the fastest.

score 28 · Accepted Answer · answered Feb 14 '14 at 22:24

For fast slicing with h5py, stick to the "plain-vanilla" slice notation:

file['test'][0:300000]

or, for example, reading every other element:

file['test'][0:300000:2]

Simple slicing (slice objects and single integer indices) should be very fast, as it translates directly into HDF5 hyperslab selections.

The expression file['test'][range(300000)] invokes h5py's version of "fancy indexing", namely, indexing via an explicit list of indices. There's no native way to do this in HDF5, so h5py implements a (slower) method in Python, which unfortunately has abysmal performance when the lists are > 1000 elements. Likewise for file['test'][np.arange(300000)], which is interpreted in the same way.

See also:

[1] http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

[2] https://github.com/h5py/h5py/issues/293

The expression file['test'][range(300000)] invokes h5py's version of "fancy indexing" — JiaYow, Feb 15 '14 at 10:36
thank you for that 'fancy-indexing' link, I wanted to grab individual elements from the array and that link showed me how to do that =) — Azizbro, Feb 18 '21 at 03:25

score 4 · Answer 2 · answered Feb 14 '14 at 01:42

4

The .value method is copying the data to memory as a numpy array. Try comparing type(file["test"]) with type(file["test"].value): the former should be an HDF5 dataset, the latter a numpy array.

I'm not familiar enough with the h5py or HDF5 internals to tell you exactly why certain dataset operations are slow; but the reason those two are different is that in one case you're slicing a numpy array in memory, and in the other slicing an HDF5 dataset from disk.

answered Feb 14 '14 at 01:42

Channing Moore

438
3
7

The performance of slicing in memory vs. slicing in file depends on a lot of things, including the speed of your disk and the file system overhead. It's possible that flushing 300,000 transactions incurs more overhead than just reading the whole array in, much the same way that using tar to copy an archive of 300,000 tiny files would speed things up. – Channing Moore Feb 14 '14 at 16:22
1

I've fiddled around a little, and I get faster performance if I read *one* row from the file than if I load the entire array. That is, `file["test"][100]` is faster than `file["test"].value`. It looks like h5py isn't implemented to convert indexing this way into slicing, even in your case where it's equivalent to `Slice(None)`. Now that I think of it, I had to manually convert array indices to `Slice` objects once to speed up an HDF5 read. – Channing Moore Feb 14 '14 at 16:44
Was the Slice approach significantly faster than array indices? – JiaYow Feb 14 '14 at 22:06
Yes, provided that it's a contiguous slice like `Slice(140, 300)`. – Channing Moore Feb 14 '14 at 22:52

score 3 · Answer 3 · answered Feb 13 '14 at 22:49

Based on the title of your post, the 'correct' way to slice array datasets is to use the builtin slice notation

All of your answers would be equivalent to file["test"][:]

[:] selects all elements in the array

More information about slicing notation can be found here, http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

I use hdf5 + python often, I've never had to use the .value methods. When you access a dataset in an array like such as myarr = file["test"]

python copies the dataset in the hdf5 into an array for you already.

h5py: Correct way to slice array datasets

3 Answers3

Linked