How to improve data input pipeline performance?

Question

I try to optimize my data input pipeline. The dataset is a set of 450 TFRecord files of size ~70MB each, hosted on GCS. The job is executed with GCP ML Engine. There is no GPU.

Here is the pipeline:

def build_dataset(file_pattern):
    return tf.data.Dataset.list_files(
        file_pattern
    ).interleave(
        tf.data.TFRecordDataset,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
    ).shuffle(
        buffer_size=2048
    ).batch(
        batch_size=2048,
        drop_remainder=True,
    ).cache(
    ).repeat(
    ).map(
        map_func=_parse_example_batch,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
    ).prefetch(
        buffer_size=1
    )

With the mapped function:

def _bit_to_float(string_batch: tf.Tensor):
    return tf.reshape(tf.math.floormod(tf.dtypes.cast(tf.bitwise.right_shift(
        tf.expand_dims(tf.io.decode_raw(string_batch, tf.uint8), 2),
        tf.reshape(tf.dtypes.cast(tf.range(7, -1, -1), tf.uint8), (1, 1, 8))
    ), tf.float32), 2), (tf.shape(string_batch)[0], -1))


def _parse_example_batch(example_batch):
    preprocessed_sample_columns = {
        "features": tf.io.VarLenFeature(tf.float32),
        "booleanFeatures": tf.io.FixedLenFeature((), tf.string, ""),
        "label": tf.io.FixedLenFeature((), tf.float32, -1)
    }
    samples = tf.io.parse_example(example_batch, preprocessed_sample_columns)
    dense_float = tf.sparse.to_dense(samples["features"])
    bits_to_float = _bit_to_float(samples["booleanFeatures"])
    return (
        tf.concat([dense_float, bits_to_float], 1),
        tf.reshape(samples["label"], (-1, 1))
    )

I tried to follow the best practices of the data pipeline tutorial, and vectorize my mapped function (as advised by mrry).

With this settings, while data are downloaded at high-speed (bandwidth is around 200MB/s) the CPU is under-used (14%) and the training is very slow (more than 1hour for a epoch).

I tried some parameters configuration, changing the interleave() arguments like num_parallel_calls or cycle_length or the TFRecordDataset arguments like num_parallel_calls.

The fastest configuration uses this set of parameters:

interleave.num_parallel_calls: 1
interleave.cycle_length: 8
TFRecordDataset.num_parallel_calls: 8

With this one, one epoch only take ~20 minutes to run. However, CPU usage is only at 50% while bandwidth consumption is around 55MB/s

Questions:

How to optimize the pipeline to reach 100% CPU usage (and something like 100MB/s of bandwidth consumption)?
Why does tf.data.experimental.AUTOTUNE not find best value to speed up the training?

Kind, Alexis.

Edit

After some more experimentations, I came to the following solution.

Remove the interleave step which is already handled by TFRecordDataset if num_parallel_calls is greater than 0.
Update the mapped function to only do parse_example and decode_raw, returning a tuple `((, ), ())
cache after the map
Move the _bit_to_float function as a component of the model

Finally, here is the data pipeline code:

def build_dataset(file_pattern):
    return tf.data.TFRecordDataset(
        tf.data.Dataset.list_files(file_pattern),
        num_parallel_reads=multiprocessing.cpu_count(),
        buffer_size=70*1000*1000
    ).shuffle(
        buffer_size=2048
    ).map(
        map_func=split,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
    ).batch(
        batch_size=2048,
        drop_remainder=True,
    ).cache(
    ).repeat(
    ).prefetch(
        buffer_size=32
    )


def split(example):
    preprocessed_sample_columns = {
        "features": tf.io.VarLenFeature(tf.float32),
        "booleanFeatures": tf.io.FixedLenFeature((), tf.string, ""),
        "label": tf.io.FixedLenFeature((), tf.float32, -1)
    }
    samples = tf.io.parse_single_example(example, preprocessed_sample_columns)
    dense_float = tf.sparse.to_dense(samples["features"])
    bits_to_float = tf.io.decode_raw(samples["booleanFeatures"], tf.uint8)
    return (
        (dense_float, bits_to_float),
        tf.reshape(samples["label"], (1,))
    )


def build_model(input_shape):
    feature = keras.Input(shape=(N,))
    bool_feature = keras.Input(shape=(M,), dtype="uint8")
    one_hot = dataset._bit_to_float(bool_feature)
    dense_input = tf.reshape(
        keras.backend.concatenate([feature, one_hot], 1),
        input_shape)
    output = actual_model(dense_input)

    model = keras.Model([feature, bool_feature], output)
    return model

def _bit_to_float(string_batch: tf.Tensor):
    return tf.dtypes.cast(tf.reshape(
        tf.bitwise.bitwise_and(
            tf.bitwise.right_shift(
                tf.expand_dims(string_batch, 2),
                tf.reshape(
                    tf.dtypes.cast(tf.range(7, -1, -1), tf.uint8),
                    (1, 1, 8)
                ),
            ),
            tf.constant(0x01, dtype=tf.uint8)
        ),
        (tf.shape(string_batch)[0], -1)
    ), tf.float32)

Thanks to all these optimizations:

Bandwidth consumption is around 90MB/s
CPU usage is around 20%
First epoch spends 20 minutes
Successives epochs spend 5 minutes each

So this seems to be a good first setup. But CPU and BW are still not overused, so any advice is still welcomed!

Edit Bis

So, after some benchmarking I came accross what I think is our best input pipeline:

def build_dataset(file_pattern):
    tf.data.Dataset.list_files(
        file_pattern
    ).interleave(
        TFRecordDataset,
        cycle_length=tf.data.experimental.AUTOTUNE,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
    ).shuffle(
        2048
    ).batch(
        batch_size=64,
        drop_remainder=True,
    ).map(
        map_func=parse_examples_batch,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
    ).cache(
    ).prefetch(
        tf.data.experimental.AUTOTUNE
    )

def parse_examples_batch(examples):
    preprocessed_sample_columns = {
        "features": tf.io.FixedLenSequenceFeature((), tf.float32, allow_missing=True),
        "booleanFeatures": tf.io.FixedLenFeature((), tf.string, ""),
        "label": tf.io.FixedLenFeature((), tf.float32, -1)
    }
    samples = tf.io.parse_example(examples, preprocessed_sample_columns)
    bits_to_float = tf.io.decode_raw(samples["booleanFeatures"], tf.uint8)
    return (
        (samples['features'], bits_to_float),
        tf.expand_dims(samples["label"], 1)
    )

So, what's new:

According to this GitHub issue, the TFRecordDataset interleaving is a legacy one, so interleave function is better.
batch before map is a good habit (vectorizing your function) and reduce the number of times the mapped function is called.
No need of repeat anymore. Since TF2.0, the Keras model API supports the dataset API and can use cache (see the SO post)
Switch from a VarLenFeature to a FixedLenSequenceFeature, removing a useless call to tf.sparse.to_dense.

Hope this can help. Advices are still welcomed.

Thank you for not only asking the right question, but also for providing the answer with it. If I could, I would plus-two. :) EDIT: Actually, I just did sort of - I've upvoted your other answer that refers to this one. :) — Innocent Bystander, May 28 '20 at 03:04
@InnocentBystander You're welcome ^^ Thanks for the votes, they awarded me some badges too! — AlexisBRENON, May 28 '20 at 07:16

score 12 · Accepted Answer · answered Nov 27 '19 at 05:59

Mentioning the Solution and the Important observations of @AlexisBRENON in the Answer Section, for the benefit of the Community.

Below mentioned are the Important Observations:

According to this GitHub issue, the TFRecordDataset interleaving is a legacy one, so interleave function is better.
batch before map is a good habit (vectorizing your function) and reduce the number of times the mapped function is called.
No need of repeat anymore. Since TF2.0, the Keras model API supports the dataset API and can use cache (see the SO post)
Switch from a VarLenFeature to a FixedLenSequenceFeature, removing a useless call to tf.sparse.to_dense.

Code for the Pipeline, with improved performance, in line with above observations is mentioned below:

def build_dataset(file_pattern):
    tf.data.Dataset.list_files(
        file_pattern
    ).interleave(
        TFRecordDataset,
        cycle_length=tf.data.experimental.AUTOTUNE,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
    ).shuffle(
        2048
    ).batch(
        batch_size=64,
        drop_remainder=True,
    ).map(
        map_func=parse_examples_batch,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
    ).cache(
    ).prefetch(
        tf.data.experimental.AUTOTUNE
    )

def parse_examples_batch(examples):
    preprocessed_sample_columns = {
        "features": tf.io.FixedLenSequenceFeature((), tf.float32, allow_missing=True),
        "booleanFeatures": tf.io.FixedLenFeature((), tf.string, ""),
        "label": tf.io.FixedLenFeature((), tf.float32, -1)
    }
    samples = tf.io.parse_example(examples, preprocessed_sample_columns)
    bits_to_float = tf.io.decode_raw(samples["booleanFeatures"], tf.uint8)
    return (
        (samples['features'], bits_to_float),
        tf.expand_dims(samples["label"], 1)
    )

score 0 · Answer 2 · answered Nov 03 '20 at 07:23

0

I have a further suggestion to add:

According to the documentation of interleave(), you can as the first parameter use a mapping function.

This means, one can write:

 dataset = tf.data.Dataset.list_files(file_pattern)
 dataset = dataset.interleave(lambda x:
    tf.data.TFRecordDataset(x).map(parse_fn, num_parallel_calls=AUTOTUNE),
    cycle_length=tf.data.experimental.AUTOTUNE,
    num_parallel_calls=tf.data.experimental.AUTOTUNE
    )

As I understand it, this maps a parsing function to each shard, and then interleaves the results. This then eliminates the use of dataset.map(...) later on.

answered Nov 03 '20 at 07:23

emil

85
7

I did not experiment a lot lastly. But I don't think that your solution brings much improvements. I think that `interleave` take care of IO blocking behavior (not requiring CPU) while `map` is most of the time CPU intensive (so it cannot be parallelized a lot). So, I think your solution is roughly equivalent to `interleave().map()`. To make sure, feel free to experiment with [this](https://www.tensorflow.org/guide/data_performance#reproducing_the_figures), or [this](https://www.tensorflow.org/guide/data_performance). – AlexisBRENON Nov 03 '20 at 07:45

How to improve data input pipeline performance?

Questions:

Edit

Edit Bis

2 Answers2

Linked