Keras preprocessing: number of samples-CodePudding

I have been utilizing the keras preprocessing method keras.preprocessing.image_dataset_from_directory()

Here is my x and y train batches:

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    train_path,
    label_mode = 'categorical', #it is used for multiclass classification. It is one hot encoded labels for each class
    validation_split = 0.2,     #percentage of dataset to be considered for validation
    subset = "training",        #this subset is used for training
    seed = 1337,                # seed is set so that same results are reproduced
    image_size = img_size,      # shape of input images
    batch_size = batch_size,    # This should match with model batch size
)

valid_ds = tf.keras.preprocessing.image_dataset_from_directory(
    train_path,
    label_mode ='categorical',
    validation_split = 0.2,
    subset = "validation",      #this subset is used for validation
    seed = 1337,
    image_size = img_size,
    batch_size = batch_size,
)

I wanted to know if there was a way to collect an equal sample size for each class?

Below you can see the number of sample images per class in the target directory:

CodePudding user response：

To recap what's in comments: The problem is about an imbalanced dataset, training a model on an imbalanced dataset without any measures would result obviously in an biased model.

To tackle this, Keras.fit() has an argument called class_weight. I quote the description given in the documentation:

class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.

Now to calculate your class weights, you can use this formula and calculate it manually, for each class j:

w_j= total_number_samples / (n_classes * n_samples_j)

Example:

A: 50
B: 100
C: 200

wa = 350/(3*50) = 2.3
wb =  350/(3*100) = 1.16
wc =  350/(3*200) = 0.58

Or you can use scikit-learn:

#Import the function
from sklearn.utils import class_weight

# get class weights
class_weights = class_weight.compute_class_weight('balanced',
                                             np.unique(y_train),
                                             y_train)

# use the class weights for training
model.fit(X_train, y_train, class_weight=class_weights)