I have a folder called train_ds (don't get confused by the name, is just a folder with pics) in which I have 5 subfolders with pictures. Each one is a different class.
I'm running 5 different trained models over this train_ds folder to get the inferences. What I do want is to explicitly get in which pictures all models fail to infer right. For that:
- Use the tf method image_dataset_from_directory to load pics.
- Use the function inferences_target_list to get a list of inferred elements and the real labels. Both lists have same length.
- Use the function get_missclassified to get a list of the indexes that have different value between the inference and the real value. Voila, I got the mismatched ones for one model.
- Run the same for the 5 trained models.
- Get the common indexes for the 5 different processes.
So I could say, I have indexed all images in the train_ds folder and from all of them, I got what indexes have an image classified wwrong, for all models.
The question now is... How do I get the pictures associated to that indexes from the image_dataset_from_directory method?
Functions:
def inferences_target_list(model, data):
'''
returns 2 lists: inferences list, real labels
'''
# over train set fold1
y_pred_float = model.predict(data)
y_pred = np.argmax(y_pred_float, axis=1)
# get real labels
y_target = tf.concat([y for x, y in data], axis=0)
y_target
print("lenght inferences and real labels: ", len(y_pred), len(y_target))
return y_pred, y_target
def get_missclassified(y_pred, y_target):
'''
returns a list with the indexes of real labels that were missclassified
'''
missclassified = []
for i, (pred, target) in enumerate(zip(y_pred, y_target.numpy().tolist())):
if pred!=target:
#print(i, pred, target)
missclassified.append(i)
print("total missclassified: ",len(missclassified))
return missclassified
Method:
missclassified_train_folders=[]
for f in folders: # at the moment just 1 folder
print(f)
for nn in models_dict: # dictionary of trained models
print(nn)
# -- train dataset for each folder
train_path = reg_input f "/" 'train_ds/'
# print("\n train dataset:", "\n", train_path)
train_ds = image_dataset_from_directory(
train_path,
class_names=["Bedroom","Bathroom","Dinning","Livingroom","Kitchen"],
seed=None,
validation_split=None,
subset=None,
image_size= image_size,
batch_size= batch_size,
color_mode='rgb',
shuffle=False
)
# inferences and real values
y_pred, y_target = inferences_target_list(models_dict[nn], train_ds)
# missclassified ones
missclassified = get_missclassified(y_pred, y_target)
print("elements missclassified in {} for model {}: ".format(f, nn), len(missclassified))
missclassified_train_folders.append(missclassified)
I got the list of indexes, but I don't know how to apply it.
Thanks in advance! | (• ◡•)| (❍ᴥ❍ʋ)
CodePudding user response:
image_dataset_from_directory
uses index_directory
function behind the scenes to index the directories. basically it sorts the subdirectories using python sorted
and loops through them with a ThreadPool
You can directly import it and use it to return the file paths, labels and the index of course.
Check it out at: https://github.com/keras-team/keras/blob/d8fcb9d4d4dad45080ecfdd575483653028f8eda/keras/preprocessing/dataset_utils.py#L26
You can use something like this to get the indexed format of the dataset
from keras.preprocessing.dataset_utils import index_directory
ALLOWLIST_FORMATS = ('.bmp', '.gif', '.jpeg', '.jpg', '.png')
file_paths, labels, class_names = index_directory(directory="/path/to/train_ds", labels="inferred", formats=ALLOWLIST_FORMATS)
Also, keep shuffle to False
CodePudding user response:
the given by @ma7555 was the simple solution I was looking for, nevertheless the labels list output with the ma755 method is different than the one using tf.concat([y for x, y in train_ds], axis=0).
train_ds is created using the image_dataset_from_directory method, and have 5 subfolders inside (mi classes). The clumsy solution I got at the moment is:
- get list of inferred labels and real ones with inferences_target_list
- compare 2 lists, check what labels are different and store their index with get_missclassified
- get the list of elements in folders with get_list_of_files. this should be the same than paths for ma7555. i didn't check if the order was the same yet
def inferences_target_list(model, data):
'''
returns 2 lists: inferences list, real labels
'''
# over train set fold1
y_pred_float = model.predict(data)
y_pred = np.argmax(y_pred_float, axis=1)
# get real labels
y_target = tf.concat([y for x, y in data], axis=0)
y_target
print("lenght inferences and real labels: ", len(y_pred), len(y_target))
return y_pred, y_target
def get_missclassified(y_pred, y_target):
'''
returns a list with the indexes of real labels that were missclassified
'''
missclassified = []
for i, (pred, target) in enumerate(zip(y_pred, y_target.numpy().tolist())):
if pred!=target:
#print(i, pred, target)
missclassified.append(i)
print("total missclassified: ",len(missclassified))
return missclassified
def get_list_of_files(dirName):
'''
create a list of file and sub directories names in the given directory
found here => https://thispointer.com/python-how-to-get-list-of-files-in-directory-and-sub-directories/
'''
listOfFile = os.listdir(dirName)
allFiles = list()
# Iterate over all the entries
for entry in listOfFile:
# Create full path
fullPath = os.path.join(dirName, entry)
# If entry is a directory then get the list of files in this directory
if os.path.isdir(fullPath):
allFiles = allFiles get_list_of_files(fullPath)
else:
allFiles.append(fullPath)
return allFiles
Start
misclassified_train_folders=[]
for f in folders:
print(f)
for nn in models_dict:
#print(nn)
# -- train dataset for each folder
train_path = reg_input f "/" 'train_ds/'
# print("\n train dataset:", "\n", train_path)
train_ds = image_dataset_from_directory(
train_path,
class_names=["Bedroom","Bathroom","Dinning","Livingroom","Kitchen"],
seed=None,
validation_split=None,
subset=None,
image_size= image_size,
batch_size= batch_size,
color_mode='rgb',
shuffle=False
)
# list of paths for analysed images
pic_list = get_list_of_files(train_path)
# inferences and real values
y_pred, y_target = inferences_target_list(models_dict[nn], train_ds)
# misclassified ones
misclassified = get_misclassified(y_pred, y_target)
print("elements misclassified in {} for model {}: ".format(f, nn), len(misclassified))
misclassified_train_folders.append(misclassified)
- Now I have a list with 5 lists inside: Those lists are made with all misclassified elements by every model in my first folder. Getting the pictures that always are misclassified:
common_misclassified = list(set.intersection(*map(set, misclassified_train_folders)))
# this are the indexes of that images
print(len(common_misclassified), "\n", common_misclassified)
- to get the path of those pics:
pic_list_missclassified = [pic_list[i] for i in common_missclassified]
# indexes of common missclassified elements for all models
print(len(pic_list_missclassified))