What's the best way to re-order rows in a large binary file?

Question

I have some large data files (32 x VERY BIG) that I would like to concatenate. However, the data were collected in the wrong order, so I need to reorder the rows as well.

So far, what I am doing is:

# Assume FILE_1 and FILE_2 are paths to the appropriate files.
# FILE_1 is a matrix of size 32 x SIZE_1
# FILE_2 is a matrix of size 32 x SIZE_2
data_1 = np.memmap(FILE_1, mode='r', dtype='<i2', order='F', shape=(32, SIZE_1))
data_2 = np.memmap(FILE_2, mode='r', dtype='<i2', order='F', shape=(32, SIZE_2))

data_out = np.memmap('output', mode='w+', dtype='<i2', order='F', shape=(32, SIZE_1 + SIZE_2))

channel_mapping = [15, 14, 13, 12, 11, 10, 9, 8, 0, 1, 2, 3, 4, 5, 6, 7,
                   24, 25, 26, 27, 28, 29, 30, 31, 23, 22, 21, 20, 19, 18, 17, 16]

data_out[:SIZE_1, :] = data_1[:, channel_mapping]
data_out[SIZE_1:SIZE_2, :] = data_2[:, channel_mapping]

I actually do this in a for loop with more than 2 files, but you get the idea.

Is this the most efficient way to do this? I am afraid that the application of channel_mapping will write the data to memory and slow the whole process down. As it is, this is much slower than simply concatenating the files.

If the order of the file has to be manually set then you're not going to find a very efficient way of reordering the entire file. — Bob Smith, Mar 26 '20 at 21:55
You may be best to load the input files using memmap because you need the random access then don’t construct the corrected sequence into `data_out` in memory but by writing the sequence directly to the output file. — barny, Mar 26 '20 at 22:47

What's the best way to re-order rows in a large binary file?

0 Answers0