0

I'm trying to set up a tf.data.Dataset to stream larger-than-memory CSV files for training. I created the following benchmark to estimate processing throughput in mb/sec using a ~100mb CSV.

I'm processing in batches of 32. Ideally, I'd like to have a solution that lets me adjust the minibatch size for training without affecting dataset performance.

Here's my benchmark code:

import os
import tensorflow as tf
import pandas as pd
import numpy as np
import time

input_file = 'playstore.csv'
batch = 32
headers = ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Genres', 'Last Updated']


def get_dataset():

    def _parse_csv(text_in):
        defaults = [[''] for _ in headers]
        values = tf.decode_csv(text_in, defaults)
        values = [tf.reshape(x, (-1, 1)) for x in values]
        return dict(zip(headers, values))

    dataset = tf.data.TextLineDataset(input_file).skip(1)
    dataset = dataset.batch(batch)
    dataset = dataset.prefetch(1)
    dataset = dataset.map(_parse_csv, num_parallel_calls=8)
    return dataset


def run_benchmark(dataset):    
    # Get the whole dataset in one tensor
    dataset = dataset.take(1200000//batch)
    dataset = dataset.batch(10000000)
    next_element = dataset.make_one_shot_iterator().get_next()

    # Time it
    with tf.Session() as sess:
        tstart = time.time()
        sess.run(next_element['App'][0][0][0])
        t = time.time() - tstart

    mb = os.stat(input_file).st_size/1024/1024
    rate = mb / t
    print('time: {} seconds, speed: {} mb/sec'.format(t, rate))


run_benchmark(get_dataset())

This yields ~20mb/second on my macbook. I can increase the batch size to get better results, however if I chain it with dataset.unbatch().batch(x) to control the final batch size after parsing, overall speed goes down ~50%.

Can someone tell me:

  • Is this a reasonable way of estimating dataset throughput?
  • How can I improve performance?

Update

  • pandas.read_csv() takes 1.59 seconds (61 mb/second)
  • python csv.reader() takes 1.43 seconds (67 mb/second)
  • open / loop through f.readlines() takes 0.37 seconds (262.16 mb/second)
  • shutil.copy() takes 0.21 seconds (459.72 mb/second)
Kevin
  • 1
  • 1
  • Did you also compare to other "read_csv" functions as the main Python one and the one from pandas? Or even reading the file manually, if in the simplest CSV formatting? It may be interesting... – B. Go Mar 05 '19 at 22:44
  • Why do you prefetch only one element `prefetch(1)`, have you tried increasing the prefetch's buffer_size? – MPękalski Mar 05 '19 at 23:31
  • Added pandas, csv, plain open/loop and copy() times to my answer at the bottom. – Kevin Mar 05 '19 at 23:34
  • Increasing prefetch to 10 brings the speed to 24 mb/second, to 100 brings it to 22mb/second. – Kevin Mar 05 '19 at 23:36

0 Answers0