tf.data.Dataset: how to get the dataset size (number of elements in a epoch)?

Question

Let's say I have defined a dataset in this way:

filename_dataset = tf.data.Dataset.list_files("{}/*.png".format(dataset))

how can I get the number of elements that are inside the dataset (hence, the number of single elements that compose an epoch)?

I know that tf.data.Dataset already knows the dimension of the dataset, because the repeat() method allows repeating the input pipeline for a specified number of epochs. So it must be a way to get this information.

Do you need to have this information *before* the first epoch completed, or is it okay to compute it after? — P-Gn, Jun 07 '18 at 09:21
Working as an `iterator`, I don't think a `Dataset` knows the total number of elements before reaching the last one - then it starts repeating over if requested (c.f. source [repeat_dataset_op.cc](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/data/repeat_dataset_op.cc#L125)) — benjaminplanche, Jun 07 '18 at 09:29
Can't you just list the files in `"{}/*.png".format(dataset)` before (say via `glob` or `os.listdir`), get the length of that and then pass the list to a Dataset? Datasets don't have (natively) access to the number of items they contain (knowing that number would require a full pass on the dataset, and you still have the case of unlimited datasets coming from streaming data or generators) — GPhilo, Jun 07 '18 at 09:39
@GPhilo I could only in this particular case, but I'd like to have a more general solution. — nessuno, Jun 07 '18 at 09:45
@user1735003 thank you for you answer, I'm gonna test it soon. Can you please also add the option to get the size after the end of the first epoch? — nessuno, Jun 07 '18 at 09:46
@nessuno the thing is, there is no general solution, because Datasets don't know their size. If you have TFRecord datasets, for example, there is no way for you to know *at creation time* how may samples your dataset contains. The only way is to count them as you go, or do a full pass of the dataset before you start training (which, depending on your dataset's size, can be quite slow) — GPhilo, Jun 07 '18 at 10:01
@GPhilo understood, thank you for the explanation! However the answer of user1735003 perfectly fits my needs — nessuno, Jun 07 '18 at 10:03
From what I can see in official `tf` tutorials - they count *files* before creating a dataset, not number of elements in a dataset. — irudyak, Jul 22 '20 at 09:14

score 35 · Answer 1 · edited Oct 12 '19 at 20:16

35

len(list(dataset)) works in eager mode, although that's obviously not a good general solution.

edited Oct 12 '19 at 20:16

nbro

12,226
19
85
163

answered May 29 '19 at 21:36

markemus

964
9
18

11

It defeats the purpose of it being an iterator. Calling list() runs the entire thing in a single shot. It works for smaller amounts of data, but can likely take too many resources for larger datasets. – yrekkehs Jan 06 '20 at 10:44
2

@yrekkehs absolutely, that's why it's not a good general solution. But it works. – markemus Jan 06 '20 at 23:52
@markemus Didn't mean to sound contentious, I was just trying to answer PhonoDots. :) – yrekkehs Jan 07 '20 at 08:47
@yrekkehs gotcha, and I agree :) – markemus Jan 07 '20 at 19:52

Jacob Høxbroe Jeppesen · Answer 2 · 2019-09-18T08:56:39.057

19

Take a look here: https://github.com/tensorflow/tensorflow/issues/26966

It doesn't work for TFRecord datasets, but it works fine for other types.

TL;DR:

num_elements = tf.data.experimental.cardinality(dataset).numpy()

edited Sep 18 '19 at 08:56

answered Sep 10 '19 at 09:45

Jacob Høxbroe Jeppesen

191
1
3

score 8 · Answer 3 · answered Jul 30 '19 at 05:12

8

Unfortunately, I don't believe there is a feature like that yet in TF. With TF 2.0 and eager execution however, you could just iterate over the dataset:

num_elements = 0
for element in dataset:
    num_elements += 1

This is the most storage efficient way I could come up with

This really feels like a feature that should have been added a long time ago. Fingers crossed they add this a length feature in a later version.

answered Jul 30 '19 at 05:12

RodYt

147
2
4

6

Alternatively, a more concise way to add things up in TF 2.0: `count = dataset.reduce(0, lambda x, _: x + 1)` – Happy Gene Oct 28 '19 at 21:56
1

I found you have to call numpy() on count to get the actual value otherwise count is a tensor. i.e: count = dataset.reduce(0, lambda x, _: x + 1).numpy() – CSharp Nov 25 '19 at 09:31

score 8 · Answer 4 · edited Nov 02 '20 at 19:36

UPDATE:

Use tf.data.experimental.cardinality(dataset) - see here.

In case of tensorflow datasets you can use _, info = tfds.load(with_info=True). Then you may call info.splits['train'].num_examples. But even in this case it doesn't work properly if you define your own split.

So you may either count your files or iterate over the dataset (like described in other answers):

num_training_examples = 0
num_validation_examples = 0

for example in training_set:
    num_training_examples += 1

for example in validation_set:
    num_validation_examples += 1

P-Gn · Accepted Answer · 2018-06-09T08:38:28.940

tf.data.Dataset.list_files creates a tensor called MatchingFiles:0 (with the appropriate prefix if applicable).

You could evaluate

tf.shape(tf.get_default_graph().get_tensor_by_name('MatchingFiles:0'))[0]

to get the number of files.

Of course, this would work in simple cases only, and in particular if you have only one sample (or a known number of samples) per image.

In more complex situations, e.g. when you do not know the number of samples in each file, you can only observe the number of samples as an epoch ends.

To do this, you can watch the number of epochs that is counted by your Dataset. repeat() creates a member called _count, that counts the number of epochs. By observing it during your iterations, you can spot when it changes and compute your dataset size from there.

This counter may be buried in the hierarchy of Datasets that is created when calling member functions successively, so we have to dig it out like this.

d = my_dataset
# RepeatDataset seems not to be exposed -- this is a possible workaround 
RepeatDataset = type(tf.data.Dataset().repeat())
try:
  while not isinstance(d, RepeatDataset):
    d = d._input_dataset
except AttributeError:
  warnings.warn('no epoch counter found')
  epoch_counter = None
else:
  epoch_counter = d._count

Note that with this technique, the computation of your dataset size is not exact, because the batch during which epoch_counter is incremented typically mixes samples from two successive epochs. So this computation is precise up to your batch length.

score 5 · Answer 6 · answered Jun 08 '20 at 10:51

5

You can use this for TFRecords in TF2:

ds = tf.data.TFRecordDataset(dataset_filenames)
ds_size = sum(1 for _ in ds)

answered Jun 08 '20 at 10:51

David Bacelj

63
1
4

Timbus Calin · Answer 7 · 2020-12-28T15:58:26.860

4

As of TensorFlow (>=2.3) one can use:

 print(dataset.cardinality().numpy())

Note that the .cardinality() method was integrated into the main package (before it was in the experimental package).

Note that when applying the filter() operation this operation can return -2.

edited Dec 28 '20 at 15:58

answered Aug 19 '20 at 12:34

Timbus Calin

8,826
2
22
40

2

`train_ds.cardinality().numpy()` is given me `-2`!!! – bachr Aug 31 '20 at 00:06
It's giving you -2 because you have used .filter() somewhere in your code – Timbus Calin Aug 31 '20 at 04:55
1

https://www.tensorflow.org/api_docs/python/tf/data/experimental/cardinality – Timbus Calin Aug 31 '20 at 04:56
You can try to see, this works prior to applying filter :D – Timbus Calin Aug 31 '20 at 05:32

score 3 · Answer 8 · answered May 12 '20 at 06:21

3

In TF2.0, I do it like

for num, _ in enumerate(dataset):
    pass

print(f'Number of elements: {num}')

answered May 12 '20 at 06:21

Amal Roy

31
2

score 3 · Answer 9 · answered Nov 19 '20 at 10:39

3

This has worked for me:

lengt_dataset = dataset.reduce(0, lambda x,_: x+1).numpy()

It iterates over your dataset and increments the var x, which is returned as the length of the dataset.

answered Nov 19 '20 at 10:39

Lukas

55
5

score 1 · Answer 10 · answered Apr 03 '20 at 19:46

For some datasets like COCO, cardinality function does not return a size. One way to compute size of a dataset fast is to use map reduce, like so:

ds.map(lambda x: 1, num_parallel_calls=tf.data.experimental.AUTOTUNE).reduce(tf.constant(0), lambda x,_: x+1)

score 0 · Answer 11 · answered Apr 28 '20 at 16:42

Bit late to the party but for a large dataset stored in TFRecord datasets I used this (TF 1.15)

import tensorflow as tf
tf.compat.v1.enable_eager_execution()
dataset = tf.data.TFRecordDataset('some_path')
# Count 
n = 0
take_n = 200000
for samples in dataset.batch(take_n):
  n += take_n
  print(n)

score 0 · Answer 12 · answered Dec 13 '20 at 21:30

0

Let's say you want to find out the number of the training split in the oxford-iiit-pet dataset:

ds, info = tfds.load('oxford_iiit_pet', split='train', shuffle_files=True, as_supervised=True, with_info=True)

print(info.splits['train'].num_examples)

answered Dec 13 '20 at 21:30

Anel Music

1
2

I think your solution is incorrent. The return object, **ds**, is not the same as what **split['train']** represents. You can see what I mean by this: `(train, val), info = tfds.load('oxford_iiit_pet', split=['train[:70%]','train[70%:]'], shuffle_files=True, as_supervised=True)`. The sizes of subdatasets **train** and **val** change as we modify the percentage specified in **split=** argument. However, `info.splits['train'].num_examples` is fixed at 3680. – Li-Pin Juan Feb 19 '21 at 16:23

score 0 · Answer 13 · answered Feb 14 '21 at 17:17

0

you can do it in tensorflow 2.4.0 with just len(filename_dataset)

answered Feb 14 '21 at 17:17

alzoubi36

1

Hi, I think you are wrong. `len()` is not applicable to `tf.data.dataset` object. Based on the discussion of this [thread](https://github.com/tensorflow/tensorflow/issues/26966), it's unlikely to have this feature in the near future. – Li-Pin Juan Feb 16 '21 at 04:19
Hey, I would not describe it as not applicable. I had a dataset of 391 images and it returned exactly that. – alzoubi36 Feb 20 '21 at 06:05
I knew it works in some cases but generally it doesn't work. `len()` is unable to be applied on a Dataset object like this one, for example, `tfds.load('tf_flowers')['train'].repeat()` because the size of it is infinite. – Li-Pin Juan Feb 21 '21 at 00:36

score 0 · Answer 14 · answered Feb 26 '21 at 05:57

0

As in version=2.5.0, you can simply call print(dataset.cardinality()) to see the length and type of the dataset.

answered Feb 26 '21 at 05:57

VahidG

105
1
6

score 0 · Answer 15 · answered Mar 05 '21 at 04:18

0

I am very surprised that this problem does not have an explicit solution, because this was such a simple feature. When I iterate over the dataset through TQDM, I find that TQDM finds the data size. How does this work?

for x in tqdm(ds['train']):
  //Something

-> 1%|          | 15643/1281167 [00:16<07:06, 2964.90it/s]v

t=tqdm(ds['train'])
t.total
-> 1281167

answered Mar 05 '21 at 04:18

krenerd

543
3
19

Are you trying to answer the question or are you asking a question? – Yatin Mar 05 '21 at 05:44
@Yatin I found a very fast solution(the second code snippet), but I also want to understand how this works behind the scenes, and how to clean it up. – krenerd Mar 05 '21 at 06:57
The method doesn't work for MapDataset object. – Li-Pin Juan Apr 03 '21 at 15:21

tf.data.Dataset: how to get the dataset size (number of elements in a epoch)?

15 Answers15

Linked

Related