Tensorflow image classification. Found 2 million files but only using 416k of them-CodePudding

I am currently doing a basic image classification algorithm in tensorflow. The code basically follows almost exactly the tutorial given at https://www.tensorflow.org/tutorials/images/classification except I am using my own data.

Currently I have the following set up for generating the datasets:

#Set up information on the data
batch_size = 32
img_height = 100
img_width = 100

#Generate training dataset
train_ds = tf.keras.utils.image_dataset_from_directory(
  Directory,
  validation_split=0.8,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

#Generate val dataset
val_ds = tf.keras.utils.image_dataset_from_directory(
  Directory,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

but in the terminal output I see the following after running on our cluster:

2022-09-30 09:49:26.936639: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] 

The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2022-09-30 09:49:26.956813: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
Found 2080581 files belonging to 2 classes.
Using 416117 files for training.
Found 2080581 files belonging to 2 classes.
Using 416116 files for validation.

I don't have huge amounts of experience with tensorflow and can't really figure out how to fix this error, can anyone point me in the right direction?

CodePudding user response：

You are reserving 20% of your data for training (2080581 * 20% ≈ 416117), since the validation_split is 80%. I think you actually want it the other way around:

#Generate training dataset
train_ds = tf.keras.utils.image_dataset_from_directory(
  Directory,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

#Generate val dataset
val_ds = tf.keras.utils.image_dataset_from_directory(
  Directory,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

Check the docs for further information and this example.