3

I have two numpy arrays stored in hdf5 that are 44 GB each. I need to concatenate them together but need to do it on disk because I only have 8gb ram. How would I do this?

Thank you!

hpaulj
  • 175,871
  • 13
  • 170
  • 282
user798719
  • 8,591
  • 21
  • 77
  • 114
  • HDF5 doesn't know anything about numpy, so it's not a "numpy array". Just read a portion of them at a time and concatenate them. `[1, 2, 3, 4]`.concat(`[5, 6, 7, 8]`) is the same as `[1, 2].concat([3, 4]).concat([5, 6]).concat([7, 8])`, should should be able to do it in parts. – arboreal84 May 12 '17 at 04:44
  • There's a `h5py` module that can load arrays from a hdf5 file. And it can load in chunks. But if you can't load both files, you can't concatenate them or write the new bigger array to the file. – hpaulj May 12 '17 at 05:21
  • Possible duplicate of [Combining hdf5 files](http://stackoverflow.com/questions/18492273/combining-hdf5-files) – kazemakase May 12 '17 at 06:13
  • hpaulj, so if my hdf5 files are 44 gb each and I want to combine them into one hdf5 file, I will need 88gb of ram temporarily to combine before writing back out to hdf5 file? – user798719 May 12 '17 at 07:10
  • Yes, if you want to do that with `python`. I don't know what can be done with a `hdf5` utilities (which are C or Fortran based). – hpaulj May 12 '17 at 07:18

2 Answers2

3

The related post is to obtain distinct datasets in the resulting file. In Python it is possible but you will need to read and write the datasets in multiple operations. Say, read 1GB from file 1, write to output file, repeat until all data is read from file 1 and do the same for file 2. You need to declare the dataset in the output file of the appropriate final size directly

d = f.create_dataset('name_of_dataset', shape=shape, dtype=dtype, data=None)

where shape is computed from the datasets and dtype matches the one from the datasets.

To write to d: d[i*N:(i+1)N] = d_from_file_1[iN:(i+1)*N]

This should only loads the datasets partially in memory.

Pierre de Buyl
  • 6,434
  • 2
  • 13
  • 20
0

The file which you want to extend must have the extendable variable with at least one unlimited dimension and reasonable chunk size. Then you can easily append data to this variable and hdf5 file format is actually well suited for such a task. If appending does not work, you probably just need to create a new file, which should not be a problem. Following example will create two files and later merge data from second file to first one. Tested with files > 80G, memory use is not a problem.

import h5py
import numpy as np

ini_dim1 = 100000
ini_dim2 = 1000

counter = int(ini_dim1/10)
dim_extend = int(ini_dim1/counter)

def create_random_dataset(name, dim1, dim2):
    ff1 = h5py.File(name,'w')
    ff1.create_dataset('test_var',(ini_dim1,ini_dim2),maxshape=(None,None),chunks=(10,10))
    for i in range(counter):
        ff1['test_var'][i*dim_extend:(i+1)*dim_extend,:] = np.random.random((dim_extend,ini_dim2))
        ff1.flush()
    ff1.close()

create_random_dataset('test1.h5', ini_dim1, ini_dim2)
create_random_dataset('test2.h5', ini_dim1, ini_dim2)

## append second to first
ff3 = h5py.File('test2.h5','r')
ff4 = h5py.File('test1.h5','a')
print(ff3['test_var'])
print(ff4['test_var'])
ff4['test_var'].resize((ini_dim1*2,ini_dim2))
print(ff4['test_var'])

for i in range(counter):
    ff4['test_var'][ini_dim1+i*dim_extend:ini_dim1 + (i+1)*dim_extend,:] = ff3['test_var'][i*dim_extend:(i+1)*dim_extend,:]
    ff4.flush()
ff3.close()
ff4.close()
kakk11
  • 795
  • 4
  • 18