Broadcast error HDF5: Can't broadcast (3, 2048, 1, 1) -> (4, 2048, 1, 1)

Question

I have received the following error:

TypeError: Can't broadcast (3, 2048, 1, 1) -> (4, 2048, 1, 1)

I am extracting features and placing them into a hdf5 dataset like this:

array_40 = hdf5_file.create_dataset(
                    f'{phase}_40x_arrays',  shape, maxshape=(None, args.batch_size, 2048, 1, 1))

In (None, args.batch_size, 2048, 1, 1), None is specified due to the unknown nature of the size of dataset. args.batch_size is 4 in this case, 2048, 1 and 1 are the number of features extracted and their spatial dimensions.

shape is defined as:

shape = (dataset_length, args.batch_size, 2048, 1, 1)

However, I'm not sure what I can do with the args.batch_size, which in this case is 4. I can't leave this as None as it comes up with an illegal error:

ValueError: Illegal value in chunk tuple

EDIT: Yes, you're absolutley right. I'm trying to incrementally write to a hdf5 dataset. I've shown more of the code below. I'm extracting features and storing them incrementally into a hdf5 dataset. Despite a batch size of 4, it would be ideal to save each item from the batch, incrementally as its own instance/row.

shape = (dataset_length, 2048, 1, 1)
            all_shape = (dataset_length, 6144, 1, 1)
            labels_shape = (dataset_length)
            batch_shape = (1,)

            path = args.HDF5_dataset + f'{phase}.hdf5'

            #hdf5_file = h5py.File(path, mode='w')
            with h5py.File(path, mode='a') as hdf5_file:

                array_40 = hdf5_file.create_dataset(
                    f'{phase}_40x_arrays',  shape, maxshape=(None, 2048, 1, 1)
                )
                array_labels = hdf5_file.create_dataset(
                    f'{phase}_labels', labels_shape, maxshape=(None), dtype=string_type
                )
                array_batch_idx = hdf5_file.create_dataset(
                    f'{phase}_batch_idx', data=np.array([-1, ])
                )

                hdf5_file.close()

        # either new or checkpionted file exists
        # load file and create references to exisitng h5 datasets
        with h5py.File(path, mode='r+') as hdf5_file:
            array_40 = hdf5_file[f'{phase}_40x_arrays']
            array_labels = hdf5_file[f'{phase}_labels']
            array_batch_idx = hdf5_file[f'{phase}_batch_idx']

            batch_idx = int(array_batch_idx[0]+1)

            print("Batch ID is restarting from {}".format(batch_idx))

            dataloaders_dict = torch.utils.data.DataLoader(datasets_dict, batch_size=args.batch_size, sampler=SequentialSampler2(
                datasets_dict, batch_idx, args.batch_size),drop_last=True, num_workers=args.num_workers, shuffle=False)  # make sure shuffling is false for sampler to work and incase you restart


            for i, (inputs40x, paths40x, labels) in enumerate(dataloaders_dict):

                print(f'Batch ID: {batch_idx}')

                inputs40x = inputs40x.to(device)
                labels = labels.to(device)
                paths = paths40x

                x40 = resnet(inputs40x)

                # torch.Size([1, 2048, 1, 1]) batch, feats, 1l, 1l
                array_40[...] = x40.cpu()
                array_labels[batch_idx, ...] = labels[:].cpu()
                array_batch_idx[:,...] = batch_idx

                batch_idx +=1
                hdf5_file.flush()

The error strongly suggests that `args.batch_size` is not the same in the two different places you're using it (it's 3 somewhere). — Blckknght, May 03 '20 at 20:00
Thanks for the reply. I understand that, I should rephrase my question. How can I handle variable sizes in that dimension? For example, I have 51 instances/ rows in a dataset. With a batch size of 4, I can fill in my hdf5 dataset 12 times, however, the last batch, which will contain 3, will produce an error. I want to be able to handle the variable input size in the args.batch_size dimension. If I leave that as None, I get the following error: ValueError: Illegal value in chunk tuple. I'm not sure what I can do... — TSRAI, May 03 '20 at 20:11
@Taran, I'm not a ML/AI guy, so don't use `pytorch DataLoader`. As I understand, it returns an iterable to access the data. Your code iterates on it with `enumerate()`. As you get each batch, you will have to map that data in `inputs40x, paths40x, labels` to the next open rows in the matching HDF5 datasets. You can't use [...] You need the indices for the batch rows. Use a postion counter to do this. — kcw78, May 05 '20 at 19:35
Hi kcw78, thanks for the reply, you've been really helpful. The dataloader has a customised sequential sampler that allows the dataloader to maintain order :) In regards to the issue, I dropped the last batch. I also actually fixed the code to append each batch item by essentially applying: ` array_40[batch_idx*args.batch_size:(batch_idx+1)*args.batch_size, ...] = x40.cpu() ` — TSRAI, May 05 '20 at 21:37

kcw78 · Accepted Answer · 2020-05-04T02:03:56.490

I think you are confused on use of maxshape=() parameter. It sets maximum allocated dataset size in each dimension. The first dataset dimension is set to dataset_length at creation with maxshape[0]=None which allows for unlimited growth in size. The size of second dataset dimension at creation is args.batch_size. You specified the same size for maxshape, so you can't increase this dimension.

I'm a little confused by your example. It sounds like you trying to incrementally write data to the dataset in rows/instances of args.batch_size. Your example has 51 rows/instances of data, and you want to write in batches of args.batch_size=4. With 51 rows, you can write the first 48 rows (0-3, 4-7...44-47), then stuck with the 3 remaining rows. Can't you address this by adding a counter (call it nrows_left), and changing the batch size argument to min(args.batch_size, rows_left)? Seems like the easiest solution to me.

Without more info, I can't write a complete example.
I will attempt to show what I mean below:

# args.batch_size = 4
shape = (dataset_length, 2048, 1, 1)
array_40 = hdf5_file.create_dataset(
           f'{phase}_40x_arrays', shape, maxshape=(None, 2048, 1, 1))
nrows_left= dataset_length
rcnt = 0
loopcnt = dataset_length/args.batch_size
if dataset_length%args.batch_size != 0:
    loopcnt += 1 
for loop in range(loopcnt) :
    nload = min(nrows_left, args.batch_size)
    array_40[rcnt :row+nload] = img_data[rcnt:row+nload ]
    rcnt += nload 
    nrows_left -= nload

EDIT: Yes, you're absolutley right. I'm trying to incrementally write to a hdf5 dataset. I've shown more of the code below. I'm extracting features and storing them incrementally into a hdf5 dataset. Despite a batch size of 4, it would be ideal to save each item from the batch, incrementally as its own instance/row. I've updated the code above to reflect this task. — TSRAI, May 04 '20 at 22:30

Broadcast error HDF5: Can't broadcast (3, 2048, 1, 1) -> (4, 2048, 1, 1)

1 Answers1