2

My goal is to train a neural net for a fixed number of epochs or steps, I would like each step to use a batch of data of a specific size from a .tfrecords file.

Currently I am reading from the file using this loop:

i = 0
data = np.empty(shape=[x,y])

for serialized_example in tf.python_io.tf_record_iterator(filename):

    example = tf.train.Example()
    example.ParseFromString(serialized_example)

    Labels = example.features.feature['Labels'].byte_list.value
    # Some more features here

    data[i-1] = [Labels[0], # more features here]

    if i == 3:
        break
    i = i + 1

print data # do some stuff etc.

I am a bit of a Python noob, and I suspect that creating "i" outside the loop and breaking out when it reaches a certain value is just a hacky word-around.

Is there a way that I can read data from the file but specify "I would like the first 100 values in the byte_list that is contained within the Labels feature" and then subsequently "I would like the next 100 values".

To clarify, the thing that I am unfamiliar with is looping over a file in this manner, I am not really certain how to manipulate the loop.

Thanks.

Charmander35
  • 93
  • 1
  • 7
  • you can try using `enumerate()`: http://stackoverflow.com/questions/522563/accessing-the-index-in-python-for-loops – Shan Carter Jul 19 '16 at 19:21

2 Answers2

4

Impossible. TFRecords is a streaming reader and has no random access.

A TFRecords file represents a sequence of (binary) strings. The format is not random access, so it is suitable for streaming large amounts of data but not suitable if fast sharding or other non-sequential access is desired.

TimZaman
  • 2,607
  • 2
  • 23
  • 36
  • Thanks for the answer. I wonder if you could point me somewhere to learn about this kind of format more. Mostly I am just not sure what it might be useful for. – Charmander35 Oct 20 '16 at 06:56
  • Is this only impossible from the API available? Maybe there’s a lower level solution? – kawingkelvin May 14 '21 at 18:52
1

Expanding on the comment by Shan Carter (although it's not an ideal solution for your question) for archival purposes.

If you'd like to use enumerate() to break out from a loop at a certain iteration, you could do the following:

n = 5 # Iteration you would like to stop at
data = np.empty(shape=[x,y])

for i, serialized_example in enumerate(tf.python_io.tf_record_iterator(filename)):

    example = tf.train.Example()
    example.ParseFromString(serialized_example)

    Labels = example.features.feature['Labels'].byte_list.value
    # Some more features here

    data[i-1] = [Labels[0], Labels[1]]# more features here

    if i == n:
       break

print(data) 

Addressing your use case for .tfrecords

I would like each step to use a batch of data of a specific size from a .tfrecords file.

As mentioned by TimZaman, .tfrecords are not meant for arbitrary access of data. But seeing as you just need to continously pull batches from the .tfrecords file, you might be better off using the tf.data API to feed your model.

Adapted from the the tf.data guide:

Constructing a Dataset from .tfrecord files

filepath1 = '/path/to/file.tfrecord'
filepath2 = '/path/to/another_file.tfrecord
dataset = tf.data.TFRecordDataset(filenames = [filepath1, filepath2])

From here, if you're using the tf.keras API, you could pass dataset as an argument into model.fit like so:

model.fit(x = dataset,
          batch_size = None,
          validation_data = some_other_dataset)

Extra Stuff

Here's a blog which helps to explain .tfrecord files a little better than the tensorflow documentation.

MoltenMuffins
  • 1,526
  • 1
  • 11
  • 23