I have two numpy arrays stored in hdf5 that are 44 GB each. I need to concatenate them together but need to do it on disk because I only have 8gb ram. How would I do this?
Thank you!
I have two numpy arrays stored in hdf5 that are 44 GB each. I need to concatenate them together but need to do it on disk because I only have 8gb ram. How would I do this?
Thank you!
The related post is to obtain distinct datasets in the resulting file. In Python it is possible but you will need to read and write the datasets in multiple operations. Say, read 1GB from file 1, write to output file, repeat until all data is read from file 1 and do the same for file 2. You need to declare the dataset in the output file of the appropriate final size directly
d = f.create_dataset('name_of_dataset', shape=shape, dtype=dtype, data=None)
where shape is computed from the datasets and dtype matches the one from the datasets.
To write to d
:
d[i*N:(i+1)N] = d_from_file_1[iN:(i+1)*N]
This should only loads the datasets partially in memory.
The file which you want to extend must have the extendable variable with at least one unlimited dimension and reasonable chunk size. Then you can easily append data to this variable and hdf5 file format is actually well suited for such a task. If appending does not work, you probably just need to create a new file, which should not be a problem. Following example will create two files and later merge data from second file to first one. Tested with files > 80G, memory use is not a problem.
import h5py
import numpy as np
ini_dim1 = 100000
ini_dim2 = 1000
counter = int(ini_dim1/10)
dim_extend = int(ini_dim1/counter)
def create_random_dataset(name, dim1, dim2):
ff1 = h5py.File(name,'w')
ff1.create_dataset('test_var',(ini_dim1,ini_dim2),maxshape=(None,None),chunks=(10,10))
for i in range(counter):
ff1['test_var'][i*dim_extend:(i+1)*dim_extend,:] = np.random.random((dim_extend,ini_dim2))
ff1.flush()
ff1.close()
create_random_dataset('test1.h5', ini_dim1, ini_dim2)
create_random_dataset('test2.h5', ini_dim1, ini_dim2)
## append second to first
ff3 = h5py.File('test2.h5','r')
ff4 = h5py.File('test1.h5','a')
print(ff3['test_var'])
print(ff4['test_var'])
ff4['test_var'].resize((ini_dim1*2,ini_dim2))
print(ff4['test_var'])
for i in range(counter):
ff4['test_var'][ini_dim1+i*dim_extend:ini_dim1 + (i+1)*dim_extend,:] = ff3['test_var'][i*dim_extend:(i+1)*dim_extend,:]
ff4.flush()
ff3.close()
ff4.close()