Home > database >  How to find how many images belong to which class in keras
How to find how many images belong to which class in keras

Time:01-15

I am trying to do a sample project on Melanoma(the one from kaggle). Using tf.keras.utils.image_dataset_from_directory I got the train_ds but I would like to get print out how many images belong to each class. Example: actinic keratosis : x images basal cell carcinoma : y images

The code I used to load the data is

data_dir = pathlib.Path("Train\\")
batch_size = 32
img_height = 180
img_width = 180

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

This gives me this output

Found 6739 files belonging to 9 classes. Using 5392 files for training. and this:

class_names = train_ds.class_names
print(class_names)

gives me the name of the classes

How do I see how many images belong to which class? One way of doing this is realized is by counting the files under directory using the code below(taken from github)

def class_distribution_count(directory):
    count= []
    for path in pathlib.Path(directory).iterdir():
        if path.is_dir():
            count.append(len([name for name in os.listdir(path)
                               if os.path.isfile(os.path.join(path, name))]))
    sub_directory = [name for name in os.listdir(directory)
                    if os.path.isdir(os.path.join(directory, name))]
    return pd.DataFrame(list(zip(sub_directory,count)),columns =['Class', 'No. of Image'])
df = class_distribution_count(data_dir)
df

But I am trying to see if there is a way to get this without having to read the files in the directory but directly from the keras dataset.

Thanks In Advance

I also tried this

import pandas as pd
dataset_unbatched = tuple(train_ds.unbatch())
labels = []
for (image,label) in dataset_unbatched:
    labels.append(label.numpy())
labels = pd.Series(labels)
count = labels.value_counts()
print(count)

But I got a list of values not label names

CodePudding user response:

Just replace the index of count with the train_ds.class_names like this:

import pandas as pd
dataset_unbatched = tuple(train_ds.unbatch())
labels = []
for (image,label) in dataset_unbatched:
    labels.append(label.numpy())
labels = pd.Series(labels)

# adjustments
count = labels.value_counts().sort_index()
count.index = ds.class_names

Make sure to sort the index beforehand because it's sorted either by frequency or first occurrence.

  • Related