2

I'm working on a dataset that is too big to fit into RAM. The solution I'm trying currently is to use numpy memmap to load one sample/row at a time using Dataloader. The solution looks something like this:

class MMDataset(torch.utils.data.Dataset):
    def __init__(self, path):
        self.file_path = path
        self.dataset_len = 44000000
        self.bytes_per_value = 32/8
        self.num_cols = 512
        self.num_rows = 1


    def __getitem__(self, index):


        x = np.memmap(self.file_path, dtype='float32', mode='r', shape=(
            self.num_rows, self.num_cols), offset=int(index*self.num_cols*self.bytes_per_value))

        return np.array(x)

    def __len__(self):
        return self.dataset_len



dataset = MMDataset('./data/emb.memmap')

data_loader = DataLoader(
    dataset,
    batch_size=4096,
    shuffle=True,
    num_workers=20
)

When the amount of RAM available is greater than the size of the memmap file, the data loading is fast. I get around 60 batches/second. However, when the RAM available is less than the size of the memmap file, I get around 3 batches/second.

I discovered this when trying various sizes for the memmap file.

Why is this the case? If Dataloader + memmap is going to throttle when available RAM < memmap file size, this defeats the point of the solution.

I've observed that disk i/o is at 500MB/s read constantly when available RAM < memmap file size. This is much higher than the theoretical amount of reading required to load a batch of 4096 samples (closer to 8MB/s).

Kevin
  • 71
  • 1
  • 4
  • memmap might have buffered the entire file into memory if your RAM is large enough. What is the memory usage when the memmap file is smaller than amount of RAM? Is it close to the size of the file itself? – hkchengrex May 29 '20 at 05:19
  • I think you might be right about bufferring but the memory usage of memmap doesn't show up with the "`free" command. "watch free -g" only shows 2GB being used. – Kevin May 29 '20 at 13:40

0 Answers0