TensorFlow Dataset: Order appears randomised when iterating via For loop?-CodePudding

I am creating some batch TensorFlow datasets tf.keras.preprocessing.image_dataset_from_directory:

image_size = (90, 120)
batch_size = 32

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    os.path.join(model_split_dir,'train'),
    validation_split=0.25,
    subset="training",
    seed=1,
    image_size=image_size,
    batch_size=batch_size
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    os.path.join(model_split_dir,'train'),
    validation_split=0.25,
    subset="validation",
    seed=1,
    image_size=image_size,
    batch_size=batch_size
)
test_ds = tf.keras.preprocessing.image_dataset_from_directory(
    os.path.join(model_split_dir,'test'),
    seed=1,
    image_size=image_size,
    batch_size=batch_size
)

If I then use the following for loop to get image and label information from one of the datasets, I get different outputs each time I run it:

for images, labels in test_ds:
  print(labels)

For instance, the first batch will appear like this in one run:

tf.Tensor([0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1], shape=(32,), dtype=int32)

But then be completely different when the loop is run again;

tf.Tensor([1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 0], shape=(32,), dtype=int32)

How can the order be different every time I loop over it? Are TensorFlow datasets unordered? From what I've found, they are supposed to be ordered, so I have no idea why the for loop returns the labels in different orders each time.

Any insight regarding this would be much appreciated.

UPDATE: The shuffling of the order of the dataset is working as intended. For my test data, I just need to set shuffle to False. Many thanks @AloneTogether !

CodePudding user response：

The parameter shuffle of tf.keras.preprocessing.image_dataset_from_directory is set to True by default, if you want deterministic results, maybe try setting it to False:

import tensorflow as tf
import pathlib

dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  image_size=(28, 28),
  batch_size=5,
  shuffle=False)

for x, y in train_ds:
  print(y)
  break

This, on the other hand, will always yield random results:

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  seed=None,
  image_size=(28, 28),
  batch_size=5,
  shuffle=True)

for x, y in train_ds:
  print(y)
  break

If you set a random seed and shuffle=True, the dataset will be shuffled once but you will have deterministic results:

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  seed=123,
  image_size=(28, 28),
  batch_size=5,
  shuffle=True)

for x, y in train_ds:
  print(y)
  break