I am trying to do a sample project on Melanoma(the one from kaggle). Using tf.keras.utils.image_dataset_from_directory I got the train_ds but I would like to get print out how many images belong to each class. Example: actinic keratosis : x images basal cell carcinoma : y images
The code I used to load the data is
data_dir = pathlib.Path("Train\\")
batch_size = 32
img_height = 180
img_width = 180
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
This gives me this output
Found 6739 files belonging to 9 classes. Using 5392 files for training. and this:
class_names = train_ds.class_names
print(class_names)
gives me the name of the classes
How do I see how many images belong to which class? One way of doing this is realized is by counting the files under directory using the code below(taken from github)
def class_distribution_count(directory):
count= []
for path in pathlib.Path(directory).iterdir():
if path.is_dir():
count.append(len([name for name in os.listdir(path)
if os.path.isfile(os.path.join(path, name))]))
sub_directory = [name for name in os.listdir(directory)
if os.path.isdir(os.path.join(directory, name))]
return pd.DataFrame(list(zip(sub_directory,count)),columns =['Class', 'No. of Image'])
df = class_distribution_count(data_dir)
df
But I am trying to see if there is a way to get this without having to read the files in the directory but directly from the keras dataset.
Thanks In Advance
I also tried this
import pandas as pd
dataset_unbatched = tuple(train_ds.unbatch())
labels = []
for (image,label) in dataset_unbatched:
labels.append(label.numpy())
labels = pd.Series(labels)
count = labels.value_counts()
print(count)
But I got a list of values not label names
CodePudding user response:
Just replace the index
of count
with the train_ds.class_names
like this:
import pandas as pd
dataset_unbatched = tuple(train_ds.unbatch())
labels = []
for (image,label) in dataset_unbatched:
labels.append(label.numpy())
labels = pd.Series(labels)
# adjustments
count = labels.value_counts().sort_index()
count.index = ds.class_names
Make sure to sort the index beforehand because it's sorted either by frequency or first occurrence.