Sampling for large class and augmentation for small classes in each batch-CodePudding

Let's say we have 2 classes one is small and the second is large.

I would like to use for data augmentation similar to ImageDataGenerator for the small class, and sampling from each batch, in such a way, that, that each batch would be balanced. (Fro minor class- augmentation for major class- sampling).

Also, I would like to continue using image_dataset_from_directory (since the dataset doesn't fit into RAM).

CodePudding user response：

You can use tf.data.Dataset.from_generator that allows more control on your data generation without loading all your data into RAM.

def generator():
 i=0   
 while True :
   if i%2 == 0:
      elem = large_class_sample()
   else :
      elem =small_class_augmented()

   yield elem
   i=i 1
  

ds= tf.data.Dataset.from_generator(
         generator,
         output_signature=(
             tf.TensorSpec(shape=elem.shape , dtype=elem.type))

This generator will alterate samples between the two classes,and you can add more dataset operations(batch , shuffle..)

CodePudding user response：

I didn't totally follow the problem. Would psuedo-code this work? Perhaps there are some operators on tf.data.Dataset that are sufficient to solve your problem.

ds = image_dataset_from_directory(...)

ds1=ds.filter(lambda image, label: label == MAJORITY)
ds2=ds.filter(lambda image, label: label != MAJORITY)

ds2 = ds2.map(lambda image, label: data_augment(image), label)

ds1.batch(int(10. / MAJORITY_RATIO))
ds2.batch(int(10. / MINORITY_RATIO))

ds3 = ds1.zip(ds2)

ds3 = ds3.map(lambda left, right: tf.concat(left, right, axis=0)