I have a folder called train_ds (don't get confused by the name, is just a folder with pics) in which I have 5 subfolders with pictures. Each one is a different class.

I'm running 5 different trained models over this train_ds folder to get the inferences. What I do want is to explicitly get in which pictures all models fail to infer right. For that:

Use the tf method image_dataset_from_directory to load pics.
Use the function inferences_target_list to get a list of inferred elements and the real labels. Both lists have same length.
Use the function get_missclassified to get a list of the indexes that have different value between the inference and the real value. Voila, I got the mismatched ones for one model.
Run the same for the 5 trained models.
Get the common indexes for the 5 different processes.

So I could say, I have indexed all images in the train_ds folder and from all of them, I got what indexes have an image classified wwrong, for all models.

The question now is... How do I get the pictures associated to that indexes from the image_dataset_from_directory method?

Functions:

def inferences_target_list(model, data):
    '''
    returns 2 lists: inferences list, real labels
    '''
    # over train set fold1
    y_pred_float = model.predict(data)
    y_pred = np.argmax(y_pred_float, axis=1)

    # get real labels
    y_target = tf.concat([y for x, y in data], axis=0) 
    y_target
    print("lenght inferences and real labels: ", len(y_pred), len(y_target))
    return y_pred, y_target


def get_missclassified(y_pred, y_target):
  '''
  returns a list with the indexes of real labels that were missclassified
  '''
  missclassified = []
  for i, (pred, target) in enumerate(zip(y_pred, y_target.numpy().tolist())):
    if pred!=target:
      #print(i, pred, target)
      missclassified.append(i)
  print("total missclassified: ",len(missclassified))
  return missclassified

Method:

missclassified_train_folders=[]

for f in folders: # at the moment just 1 folder 
  print(f)
  for nn in models_dict: # dictionary of trained models
    print(nn)

    # -- train dataset for each folder
    train_path = reg_input f "/" 'train_ds/'
    # print("\n train dataset:", "\n", train_path)
    train_ds = image_dataset_from_directory(
        train_path,
        class_names=["Bedroom","Bathroom","Dinning","Livingroom","Kitchen"],
        seed=None,
        validation_split=None, 
        subset=None,
        image_size= image_size,
        batch_size= batch_size,
        color_mode='rgb',
        shuffle=False 
        )
    
    # inferences and real values
    y_pred, y_target = inferences_target_list(models_dict[nn], train_ds)
    
    # missclassified ones
    missclassified = get_missclassified(y_pred, y_target)
    print("elements missclassified in {} for model {}: ".format(f, nn), len(missclassified))
    missclassified_train_folders.append(missclassified)

I got the list of indexes, but I don't know how to apply it.

Thanks in advance! | (• ◡•)| (❍ᴥ❍ʋ)

CodePudding user response：

image_dataset_from_directory uses index_directory function behind the scenes to index the directories. basically it sorts the subdirectories using python sorted and loops through them with a ThreadPool

You can directly import it and use it to return the file paths, labels and the index of course.

Check it out at: https://github.com/keras-team/keras/blob/d8fcb9d4d4dad45080ecfdd575483653028f8eda/keras/preprocessing/dataset_utils.py#L26

You can use something like this to get the indexed format of the dataset

from keras.preprocessing.dataset_utils import index_directory

ALLOWLIST_FORMATS = ('.bmp', '.gif', '.jpeg', '.jpg', '.png')
file_paths, labels, class_names = index_directory(directory="/path/to/train_ds", labels="inferred", formats=ALLOWLIST_FORMATS)

Also, keep shuffle to False

CodePudding user response：

the given by @ma7555 was the simple solution I was looking for, nevertheless the labels list output with the ma755 method is different than the one using tf.concat([y for x, y in train_ds], axis=0).

train_ds is created using the image_dataset_from_directory method, and have 5 subfolders inside (mi classes). The clumsy solution I got at the moment is:

get list of inferred labels and real ones with inferences_target_list
compare 2 lists, check what labels are different and store their index with get_missclassified
get the list of elements in folders with get_list_of_files. this should be the same than paths for ma7555. i didn't check if the order was the same yet

def inferences_target_list(model, data):
    '''
    returns 2 lists: inferences list, real labels
    '''
    # over train set fold1
    y_pred_float = model.predict(data)
    y_pred = np.argmax(y_pred_float, axis=1)

    # get real labels
    y_target = tf.concat([y for x, y in data], axis=0) 
    y_target
    print("lenght inferences and real labels: ", len(y_pred), len(y_target))
    return y_pred, y_target


def get_missclassified(y_pred, y_target):
  '''
  returns a list with the indexes of real labels that were missclassified
  '''
  missclassified = []
  for i, (pred, target) in enumerate(zip(y_pred, y_target.numpy().tolist())):
    if pred!=target:
      #print(i, pred, target)
      missclassified.append(i)
  print("total missclassified: ",len(missclassified))
  return missclassified

def get_list_of_files(dirName):
    '''
    create a list of file and sub directories names in the given directory
    found here => https://thispointer.com/python-how-to-get-list-of-files-in-directory-and-sub-directories/
    ''' 
    listOfFile = os.listdir(dirName)
    allFiles = list()
    # Iterate over all the entries
    for entry in listOfFile:
        # Create full path
        fullPath = os.path.join(dirName, entry)
        # If entry is a directory then get the list of files in this directory 
        if os.path.isdir(fullPath):
            allFiles = allFiles   get_list_of_files(fullPath)
        else:
            allFiles.append(fullPath)
                
    return allFiles

Start

misclassified_train_folders=[]

for f in folders:
  print(f)
  for nn in models_dict:
    #print(nn)

    # -- train dataset for each folder
    train_path = reg_input f "/" 'train_ds/'
    # print("\n train dataset:", "\n", train_path)
    train_ds = image_dataset_from_directory(
        train_path,
        class_names=["Bedroom","Bathroom","Dinning","Livingroom","Kitchen"],
        seed=None,
        validation_split=None, 
        subset=None,
        image_size= image_size,
        batch_size= batch_size,
        color_mode='rgb',
        shuffle=False 
        )
    
    # list of paths for analysed images
    pic_list = get_list_of_files(train_path)
    
    # inferences and real values
    y_pred, y_target = inferences_target_list(models_dict[nn], train_ds)
    
    # misclassified ones
    misclassified = get_misclassified(y_pred, y_target)
    print("elements misclassified in {} for model {}: ".format(f, nn), len(misclassified))
    misclassified_train_folders.append(misclassified)

Now I have a list with 5 lists inside: Those lists are made with all misclassified elements by every model in my first folder. Getting the pictures that always are misclassified:

common_misclassified = list(set.intersection(*map(set, misclassified_train_folders)))
# this are the indexes of that images
print(len(common_misclassified), "\n", common_misclassified)

to get the path of those pics:

pic_list_missclassified = [pic_list[i] for i in common_missclassified]

# indexes of common missclassified elements for all models
print(len(pic_list_missclassified))