I am playing with augmentation of data in Keras lately and I am using basic ImageDataGenerator. I learned the hard way it is actually a generator, not iterator (because type(train_aug_ds)
gives <class 'keras.preprocessing.image.DirectoryIterator'>
I thought it is an iterator). I also checked few blogs about using it, but they don't answer all my questions.
So, I loaded my data like this:
train_aug = ImageDataGenerator(
rescale=1./255,
horizontal_flip=True,
height_shift_range=0.1,
width_shift_range=0.1,
brightness_range=(0.5,1.5),
zoom_range = [1, 1.5],
)
train_aug_ds = train_aug.flow_from_directory(
directory='./train',
target_size=image_size,
batch_size=batch_size,
)
And to train my model I did the following:
model.fit(
train_aug_ds,
epochs=150,
validation_data=(valid_aug_ds,),
)
And it worked. I am a bit confused how it works, because train_aug_ds
is generator, so it should give infinitely big dataset. And documentation says:
When passing an infinitely repeating dataset, you must specify the steps_per_epoch argument.
Which I didn't do, yet, it works. Does it somehow infer number of steps? Also, does it use only augmented data, or it also uses non-augmented images in batch?
So basically, my question is how to use this generator correctly with function fit
to have all data in my training set, including original, non-augmented images and augmented images, and to cycle through it several times/steps (right now it seems it does only one step per epoch)?
CodePudding user response:
I think the documentation can be quite confusing and I imagine the behavior is different depending on your Tensorflow and Keras version. For example, in this post, the user is describing the exact behavior you are expecting. Generally, the flow_from_directory()
method allows you to read the images directly from a directory and augment them while your model is being trained and as already stated here, it iterates for every sample in each folder every epoch. Using the following example, you can check that this is the case (on TF 2.7) by looking at the steps per epoch in the progress bar:
import tensorflow as tf
BATCH_SIZE = 64
flowers = tf.keras.utils.get_file(
'flower_photos',
'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
untar=True)
img_gen = tf.keras.preprocessing.image.ImageDataGenerator(
rescale=1./255,
horizontal_flip=True,
)
train_ds = img_gen.flow_from_directory(flowers, batch_size=BATCH_SIZE, shuffle=True, class_mode='sparse')
num_classes = 5
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu', input_shape=(256, 256, 3)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(num_classes)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))
epochs=10
history = model.fit(
train_ds,
epochs=epochs
)
Found 3670 images belonging to 5 classes.
Epoch 1/10
6/58 [==>...........................] - ETA: 3:02 - loss: 2.0608
If you wrap flow_from_directory
with tf.data.Dataset.from_generator
like this:
train_ds = tf.data.Dataset.from_generator(
lambda: img_gen.flow_from_directory(flowers, batch_size=BATCH_SIZE, shuffle=True, class_mode='sparse'),
output_types=(tf.float32, tf.float32))
You will notice that the progress bar looks like this because steps_per_epoch
has not been explicitly defined:
Epoch 1/10
Found 3670 images belonging to 5 classes.
29/Unknown - 104s 4s/step - loss: 2.0364
And if you add this parameter, you will see the steps in the progress bar:
history = model.fit(
train_ds,
steps_per_epoch = len(from_directory),
epochs=epochs
)
Found 3670 images belonging to 5 classes.
Epoch 1/10
3/58 [>.............................] - ETA: 3:19 - loss: 4.1357
Finally, to your question:
How to use this generator correctly with function fit to have all data in my training set, including original, non-augmented images and augmented images, and to cycle through it several times/step?
You can simply increase the steps_per_epoch
beyond number of samples // batch_size
by multiplying by some factor:
history = model.fit(
train_ds,
steps_per_epoch = len(from_directory)*2,
epochs=epochs
)
Found 3670 images belonging to 5 classes.
Epoch 1/10
1/116 [..............................] - ETA: 12:11 - loss: 1.5885
Now instead of 58 steps per epoch you have 116.