I am creating some batch TensorFlow datasets tf.keras.preprocessing.image_dataset_from_directory:
image_size = (90, 120)
batch_size = 32
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
os.path.join(model_split_dir,'train'),
validation_split=0.25,
subset="training",
seed=1,
image_size=image_size,
batch_size=batch_size
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
os.path.join(model_split_dir,'train'),
validation_split=0.25,
subset="validation",
seed=1,
image_size=image_size,
batch_size=batch_size
)
test_ds = tf.keras.preprocessing.image_dataset_from_directory(
os.path.join(model_split_dir,'test'),
seed=1,
image_size=image_size,
batch_size=batch_size
)
If I then use the following for loop to get image and label information from one of the datasets, I get different outputs each time I run it:
for images, labels in test_ds:
print(labels)
For instance, the first batch will appear like this in one run:
tf.Tensor([0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1], shape=(32,), dtype=int32)
But then be completely different when the loop is run again;
tf.Tensor([1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 0], shape=(32,), dtype=int32)
How can the order be different every time I loop over it? Are TensorFlow datasets unordered? From what I've found, they are supposed to be ordered, so I have no idea why the for loop returns the labels in different orders each time.
Any insight regarding this would be much appreciated.
UPDATE: The shuffling of the order of the dataset is working as intended. For my test data, I just need to set shuffle to False. Many thanks @AloneTogether !
CodePudding user response:
The parameter shuffle
of tf.keras.preprocessing.image_dataset_from_directory
is set to True
by default, if you want deterministic results, maybe try setting it to False
:
import tensorflow as tf
import pathlib
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
image_size=(28, 28),
batch_size=5,
shuffle=False)
for x, y in train_ds:
print(y)
break
This, on the other hand, will always yield random results:
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
seed=None,
image_size=(28, 28),
batch_size=5,
shuffle=True)
for x, y in train_ds:
print(y)
break
If you set a random seed and shuffle=True
, the dataset will be shuffled once but you will have deterministic results:
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
seed=123,
image_size=(28, 28),
batch_size=5,
shuffle=True)
for x, y in train_ds:
print(y)
break