I'm trying to set up a tf.data.Dataset to stream larger-than-memory CSV files for training. I created the following benchmark to estimate processing throughput in mb/sec using a ~100mb CSV.
I'm processing in batches of 32. Ideally, I'd like to have a solution that lets me adjust the minibatch size for training without affecting dataset performance.
Here's my benchmark code:
import os
import tensorflow as tf
import pandas as pd
import numpy as np
import time
input_file = 'playstore.csv'
batch = 32
headers = ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Genres', 'Last Updated']
def get_dataset():
def _parse_csv(text_in):
defaults = [[''] for _ in headers]
values = tf.decode_csv(text_in, defaults)
values = [tf.reshape(x, (-1, 1)) for x in values]
return dict(zip(headers, values))
dataset = tf.data.TextLineDataset(input_file).skip(1)
dataset = dataset.batch(batch)
dataset = dataset.prefetch(1)
dataset = dataset.map(_parse_csv, num_parallel_calls=8)
return dataset
def run_benchmark(dataset):
# Get the whole dataset in one tensor
dataset = dataset.take(1200000//batch)
dataset = dataset.batch(10000000)
next_element = dataset.make_one_shot_iterator().get_next()
# Time it
with tf.Session() as sess:
tstart = time.time()
sess.run(next_element['App'][0][0][0])
t = time.time() - tstart
mb = os.stat(input_file).st_size/1024/1024
rate = mb / t
print('time: {} seconds, speed: {} mb/sec'.format(t, rate))
run_benchmark(get_dataset())
This yields ~20mb/second on my macbook. I can increase the batch size to get better results, however if I chain it with dataset.unbatch().batch(x) to control the final batch size after parsing, overall speed goes down ~50%.
Can someone tell me:
- Is this a reasonable way of estimating dataset throughput?
- How can I improve performance?
Update
- pandas.read_csv() takes 1.59 seconds (61 mb/second)
- python csv.reader() takes 1.43 seconds (67 mb/second)
- open / loop through f.readlines() takes 0.37 seconds (262.16 mb/second)
- shutil.copy() takes 0.21 seconds (459.72 mb/second)