Reading Large HDF5 Files

Question

I am new to using HDF5 files and I am trying to read files with shapes of (20670, 224, 224, 3). Whenever I try to store the results from the hdf5 into a list or another data structure, it takes either takes so long that I abort the execution or it crashes my computer. I need to be able to read 3 sets of hdf5 files, use their data, manipulate it, use it to train a CNN model and make predictions.

Any help for reading and using these large HDF5 files would be greatly appreciated.

Currently this is how I am reading the hdf5 file:

db = h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5")
training_db = list(db['data'])

you should try chunking the data for faster io. – Vignesh Pillay Aug 02 '20 at 06:35 — Vignesh Pillay, Aug 02 '20 at 06:35

score 1 · Answer 1 · answered Aug 02 '20 at 06:54

Crashes probably mean you are running out of memory. Like Vignesh Pillay suggested, I would try chunking the data and work on a small piece of it at a time. If you are using the pandas method read_hdf you can use the iterator and chunksize parameters to control the chunking:

import pandas as pd
data_iter = pd.read_hdf('/tmp/test.hdf', key='test_key', iterator=True, chunksize=100)
for chunk in data_iter:
   #train cnn on chunk here
   print(chunk.shape)

Note this requires the hdf to be in table format

kcw78 · Answer 2 · 2020-08-03T12:51:31.970

1

My answer updated 2020-08-03 to reflect code you added to your question. As @Tober noted, you are running out of memory. Reading a dataset of shape (20670, 224, 224, 3) will become a list of 3.1G entities. If you read 3 image sets, it will require even more RAM. I assume this is image data (maybe 20670 images of shape (224, 224, 3) )? If so, you can read the data in slices with both h5py and tables (Pytables). This will return the data as a NumPy array, which you can use directly (no need to manipulate into a different data structure).

Basic process would look like this:

with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5",'r') as db:
     training_db = db['data']
     # loop to get images 1 by 1
     for icnt in range(20670) :
         image_arr = training_db [icnt,:,:,:}

     # then do something with the image

You could also read multiple images by setting the first index to a range (say icnt:icnt+100) then handle looping appropriately.

edited Aug 03 '20 at 12:51

answered Aug 02 '20 at 21:30

kcw78

3,276
2
8
29

I added how I am reading and using the hdf5 file. My computer has 8GB of RAM and everything you assumed was correct. It's 20670 images with shapes (224,224,3). Would I be able to train the CNN model in batches? – David Aug 03 '20 at 03:00
1

I updated my answer to reflect your code. Note: `training_db` is a h5py dataset object that "behaves like" a NumPy array. However it requires a lot less memory than reading the dataset contents into memory (as a list or array). I am not familiar with CNN, so don't know how to train in batches. I've seen other posts like this, so assume it can be done. Frankly, 8GB RAM isn't very much when you want to work with large data sets. – kcw78 Aug 03 '20 at 12:56

score 0 · Answer 3 · answered Jan 02 '21 at 00:12

Your problem is arising as you are running out of memory. So, Virtual Datasets come in handy while dealing with large datasets like yours. Virtual datasets allow a number of real datasets to be mapped together into a single, sliceable dataset via an interface layer. You can read more about them here https://docs.h5py.org/en/stable/vds.html

I would recommend you to start from one file at a time. Firstly, create a Virtual Dataset file of your existing data like

with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5", 'r') as db:
     data_shape = db['data'].shape
     layout = h5py.VirtualLayout(shape = (data_shape), dtype = np.uint8)
     vsource = h5py.VirtualSource(db['data'])
     with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'w', libver = 'latest') as file:
         file.create_virtual_dataset('data', layout = layout, fillvalue = 0)

This will create a virtual dataset of your existing training data. Now, if you want to manipulate your data, you should open your file in r+ mode like

with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'r+', libver = 'latest') as file:
    # Do whatever manipulation you want to do here

One more thing I would like to advise is make sure your indices while slicing are of int datatype, otherwise you will get an error.

Reading Large HDF5 Files

3 Answers3