0

Suppose I have something like the following:

image_data_generator = ImageDataGenerator(rescale=1./255)

train_generator = image_data_generator.flow_from_directory(
  'my_directory',
  target_size=(28, 28),
  batch_size=32,
  class_mode='categorical'
)

Then my train_generator is filled with data from my_directory, which contains two subfolders which separate the data into classes 0 and 1.

Suppose also I have another directory that_directory, also with data split into classes 0 and 1. I want to augment my train_generator with this additional data.

Running train_generator = image_data_generator.flow_from_directory('that_directory', ...) removes the prior data from my_directory.

Is there a way to augment or append both sets of data into one generator or an object that operates like a DirectoryIterator without changing the folder structure itself?

today
  • 27,220
  • 7
  • 64
  • 86
Richard
  • 249
  • 3
  • 5

1 Answers1

5

Just combine the generators in another generator, optionally with different augmentation configs:

idg1 = ImageDataGenerator(**idg1_configs)
idg2 = ImageDataGenerator(**idg2_configs)

g1 = idg1.flow_from_directory('idg1_dir',...)
g2 = idg2.flow_from_directory('idg2_dir',...)

def combine_gen(*gens):
    while True:
        for g in gens:
            yield next(g)

# ...
model.fit_generator(combine_gen(g1, g2), steps_per_epoch=len(g1)+len(g2), ...)

This would alternately generate batches from g1 and g2.

Note that one might suggest using itertools.chain, however you can't use that here since ImageDataGenerators generators are never-ending and ceaselessly generate batches of data. This is expected for the generator you pass to fit_generator method. From Keras doc:

...The generator is expected to loop over its data indefinitely. An epoch finishes when steps_per_epoch batches have been seen by the model.

The steps_per_epoch if not set would default to len(generator) where generator is the generator you pass to fit_generator method. The ImageDataGenerator generators can give their length, so you don't need to manually set the steps_per_epoch argument. If you would like the same thing with combined generators above, you can use this solution instead:

class CombinedGen():
    def __init__(self, *gens):
        self.gens = gens

    def generate(self):
        while True:
            for g in self.gens:
                yield next(g)

    def __len__(self):
        return sum([len(g) for g in self.gens])

# usage:
cg = CombinedGen(g1, g2)
model.fit_generator(cg.generate(), ...) # no need to set `steps_per_epoch`

You can also add __next__ and/or __iter__ methods to CombinedGen class if you are interested to directly iterate over the objects of this class (instead of iterating over cg.generate()).

today
  • 27,220
  • 7
  • 64
  • 86
  • How would this work if I am doing something like `for (data, labels) in my_directory_iterator`? It doesn't seem to me that combine_gen would have the nice iterator properties, since it perpetually yields. – Richard Jul 25 '19 at 15:56
  • @Richard "...since it perpetually yields" as I said in the last sentence this is how `ImageDataGenerator` works: it just never-endingly generates data. So `combine_gen` is not different. Further, if you carefully read the code you realize it has just wrapped the generators (i.e. it's a wrapper generator), so you can surely do something like `for (data, labels) in combine_gen(*my_generators)`: it would behave the same way (and never stops; you are responsible to stop it somehow with e.g. counting the steps). – today Jul 25 '19 at 15:59
  • @Richard The limiting is done using `steps_per_epoch` argument of `fit_generator` for `ImageDataGenerator` generators, and you can also use that with `combine_gen()` generators. – today Jul 25 '19 at 16:02
  • 1
    @Richard I just updated my answer with a solution where you don't need to manually set the `steps_per_epoch` argument. Please take a look. – today Jul 25 '19 at 16:20
  • It turns out that I'm unable to iterate over this because the wrapper doesn't have an __iter__ attribute and doesn't magically inherit it from the underlying DirectoryIterators. It also seems tricky since cg.generate() returns generator objects in Python which I'm not immediately sure how to resolve. This basically got me 90% of the way there, though, and I think it'll work with more patching and more Googling. Thanks! – Richard Aug 02 '19 at 18:26
  • @Richard Why don't you iterate over `cg.generate()` instead? It's just a generator. Further, you can easily add `__next__` and/or `__iter__` methods to the `CombinedGen` class if you want to directly iterate over the objects of that class. – today Aug 02 '19 at 19:00
  • I did end up adding `__next__` and `__iter__` methods to the class but I don't really know how to "resolve" the generator object which is returned. I think I don't really understand how generators work. Do you have a good reference for this? Quick edit: iterating over `cg.generate()` did indeed work, I think I just wrote really busted `__next__` and `__iter__` methods. I'd still be interested in the reference if you could refer me! – Richard Aug 02 '19 at 19:02
  • @Richard Why don't we get help from SO itself?! Here: [1](https://stackoverflow.com/q/1756096/2099607), [2](https://stackoverflow.com/q/231767/2099607), [3](https://stackoverflow.com/q/40255096/2099607), [4](https://stackoverflow.com/q/2776829/2099607), [5](https://stackoverflow.com/q/52056146/2099607). Let me know if anything is still unclear. – today Aug 02 '19 at 19:08