I use the image_dataset_from_directory()
function from keras to create an images dataset.
ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
color_mode='grayscale',
image_size=(img_height, img_width),
seed=42,
batch_size=batch_size,
label_mode='binary'
)
I set a fixed seed, so, if I execute this function several times, the dataset is shuffled the same way.
Meanwhile, when I try to predict my model output for this dataset (print(model.predict(ds))
),
the output is always different.
It seems like the dataset is shuffled again because when I print the images in this way :
for x, y in ds:
print(x)
the output is also different.
What don't I understand ?
CodePudding user response:
The function image_dataset_from_directory
uses the tf.data.Dataset
API. The default behaviour of tf.data.Dataset.shuffle
, is to reshuffle the dataset at each iteration. From the documentation of shuffle
:
dataset = tf.data.Dataset.range(3)
dataset = dataset.shuffle(3, reshuffle_each_iteration=True)
list(dataset.as_numpy_iterator())
# [1, 0, 2]
list(dataset.as_numpy_iterator())
# [1, 2, 0]
If you want to shuffle your dataset, but have the same exact order at each iteration over the dataset, you will need to shuffle the dataset after creating it and specify reshuffle_each_iteration=False
.
ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
color_mode='grayscale',
image_size=(img_height, img_width),
seed=42,
batch_size=batch_size,
label_mode='binary',
shuffle=False
)
# default buffer in image_dataset_from_directory is 8*batch_size
ds = ds.shuffle(buffer_size=8*batch_size, seed=42, reshuffle_each_iteration=False)
Regarding the seed, it makes the dataset being shuffled predictably at each execution of the program, not at each iteration through the dataset. You can read more about seeds and tensorflow's random behaviour in the documentation of tf.random.set_seed
.