Keras ImageDataGenerator validation_split does not split validation data as expected-CodePudding

I'm trying to learn about Computer Vision in Machine Learning with Tensorflow and Keras

I have a directory that contains 4185 images I got from https://www.kaggle.com/datasets/smaranjitghose/corn-or-maize-leaf-disease-dataset (I intentionally removed 3 images)

I have this code containing listdir() to check if it's true:

import os
folders = os.listdir('/tmp/datasets/data')
print(f'folders: {folders}')

total_images = 0
for f in folders:
  total_images  = len(os.listdir(f'/tmp/datasets/data/{f}'))

print(f'Total Images found: {total_images}')

The following is the output:

folders: ['Blight', 'Common_Rust', 'Gray_Leaf_Spot', 'Healthy']
Total Images found: 4185

I would like to split it into 80% train set and 20% validation set with Keras' ImageDataGenerator

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rescale = 1./255,
    fill_mode='nearest',
    width_shift_range = 0.05,
    height_shift_range = 0.05,
    rotation_range = 45,
    shear_range = 0.1,
    zoom_range=0.2,
    horizontal_flip = True,
    vertical_flip = True,
    validation_split = 0.2,
)

val_datagen = ImageDataGenerator(
    rescale = 1./255,
    validation_split = 0.2
)

train_images = datagen.flow_from_directory('/tmp/datasets/data',
    target_size=(150,150),
    batch_size=32,
    seed=42,
    subset='training',
    class_mode='categorical'
)

val_images = val_datagen.flow_from_directory('/tmp/datasets/data',
    target_size=(150,150), 
    batch_size=32, 
    seed=42,
    subset='validation', 
    class_mode='categorical'
)

The following is the output logged by flow_from_directory():

Found 3350 images belonging to 4 classes.
Found 835 images belonging to 4 classes.

The split done is not the expected 3348 | 837 (0.2 * 4185 = 837), did I miss something? or did I misinterpreted the parameter validation_split?

CodePudding user response：

The data is split for each folder (class) and not on the entire dataset. Check the source code here and here to understand more. Here is an example of what flow_from_directory is doing internally:

import os

folders = os.listdir('/content/data')
print(f'folders: {folders}')

total_images = 0
names = []
paths = [] 
white_list_formats = ('png', 'jpg', 'jpeg', 'bmp', 'ppm', 'tif', 'tiff')
for f in folders:
  paths.append(os.listdir(f'/content/data/{f}'))
  for d in os.listdir(f'/content/data/{f}'):
    if d.lower().endswith(white_list_formats):
      names.append(d)

print(f'Total number of valid images found: {len(names)}')

folders: ['Blight', 'Healthy', 'Common_Rust', 'Gray_Leaf_Spot']
Total number of valid images found: 4188

Split data by folders:

training_samples = 0
for p in paths:
  split = (0.2, 1)
  num_files = len(p)
  start, stop = int(split[0] * num_files), int(split[1] * num_files)
  valid_files = p[start: stop]
  training_samples  = len(valid_files)
print(training_samples)


validation_samples = 0
for p in paths:
  split = (0, 0.2)
  num_files = len(p)
  start, stop = int(split[0] * num_files), int(split[1] * num_files)
  valid_files = p[start: stop]
  validation_samples  = len(valid_files)
print(validation_samples)

3352
836

And this corresponds to what you see from flow_from_directory:

from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rescale = 1./255,
    fill_mode='nearest',
    width_shift_range = 0.05,
    height_shift_range = 0.05,
    rotation_range = 45,
    shear_range = 0.1,
    zoom_range=0.2,
    horizontal_flip = True,
    vertical_flip = True,
    validation_split = 0.2,
)

val_datagen = ImageDataGenerator(
    rescale = 1./255,
    validation_split = 0.2
)

train_images = datagen.flow_from_directory('/content/data',
    target_size=(150,150),
    batch_size=32,
    seed=42,
    subset='training',
    shuffle=False,
    class_mode='categorical'
)

val_images = val_datagen.flow_from_directory('/content/data',
    target_size=(150,150), 
    batch_size=32, 
    seed=42,
    subset='validation', 
    shuffle=False,
    class_mode='categorical'
)

Found 3352 images belonging to 4 classes.
Found 836 images belonging to 4 classes.

Note that I did not remove the 3 images like you did, but the logic remains the same.