I have been utilizing the keras preprocessing method keras.preprocessing.image_dataset_from_directory()
Here is my x and y train batches:
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
train_path,
label_mode = 'categorical', #it is used for multiclass classification. It is one hot encoded labels for each class
validation_split = 0.2, #percentage of dataset to be considered for validation
subset = "training", #this subset is used for training
seed = 1337, # seed is set so that same results are reproduced
image_size = img_size, # shape of input images
batch_size = batch_size, # This should match with model batch size
)
valid_ds = tf.keras.preprocessing.image_dataset_from_directory(
train_path,
label_mode ='categorical',
validation_split = 0.2,
subset = "validation", #this subset is used for validation
seed = 1337,
image_size = img_size,
batch_size = batch_size,
)
I wanted to know if there was a way to collect an equal sample size for each class?
Below you can see the number of sample images per class in the target directory:
CodePudding user response:
To recap what's in comments: The problem is about an imbalanced dataset, training a model on an imbalanced dataset without any measures would result obviously in an biased model.
To tackle this, Keras.fit()
has an argument called class_weight
. I quote the description given in the documentation:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
Now to calculate your class weights, you can use this formula and calculate it manually, for each class j:
w_j= total_number_samples / (n_classes * n_samples_j)
Example:
A: 50
B: 100
C: 200
wa = 350/(3*50) = 2.3
wb = 350/(3*100) = 1.16
wc = 350/(3*200) = 0.58
Or you can use scikit-learn:
#Import the function
from sklearn.utils import class_weight
# get class weights
class_weights = class_weight.compute_class_weight('balanced',
np.unique(y_train),
y_train)
# use the class weights for training
model.fit(X_train, y_train, class_weight=class_weights)