How to filter for rows with close values across columns-CodePudding

I have columns of probabilities in a pandas dataframe as an output from multiclass machine learning.

I am looking to filter rows for which the model had very close probabilities between the classes for that row, and ideally only care about similar values that are similar to the highest value in that row, but I'm not sure where to start.

For example my data looks like this:

ID    class1  class2  class3  class4  class5
row1   0.97     0.2     0.4     0.3    0.2  
row2   0.97     0.96    0.4     0.3    0.2  
row3   0.7      0.5     0.3     0.4    0.5  
row4   0.97     0.98    0.99    0.3   0.2
row5   0.1      0.2     0.3     0.78   0.8  
row6   0.1     0.11     0.3     0.9    0.2

I'd like to filter for rows where at least 2 (or more) probability class columns have a probability that is close to at least one other probability column in that row (e.g., maybe within 0.05). So an example output would filter to:

ID    class1  class2  class3  class4  class5
row2   0.97     0.96    0.4     0.3    0.2  
row4   0.97     0.98    0.99    0.3    0.2  
row5   0.1      0.2     0.3     0.78   0.8

I don't mind if a filter includes row6 as it also meets my <0.05 different main requirement, but ideally because the 0.05 difference isn't with the largest probability I'd prefer to ignore this too.

What can I do to develop a filter like this?

Example data:

Edit: I have increased the size of my example data, as I do not want pairs specifically but any and all rows that in inside their row their column values for 2 or more probabilities have close values


d = {'ID': ['row1', 'row2', 'row3', 'row4', 'row5', 'row6'], 
'class1': [0.97, 0.97, 0.7, 0.97, 0.1, 0.1], 
'class2': [0.2, 0.96, 0.5, 0.98, 0.2, 0.11], 
'class3': [0.4, 0.4, 0.3, 0.2, 0.3, 0.3], 
'class4': [0.3, 0.3, 0.4, 0.3, 0.78, 0.9], 
'class5': [0.2, 0.2, 0.5, 0.2, 0.8, 0.2]}

df = pd.DataFrame(data=d)

CodePudding user response：

Here is an example using numpy and itertools.combinations to get the pairs of similar rows with at least N matches with 0.05:

from itertools import combinations
import numpy as np

df2 = df.set_index('ID')

N = 2

out = [(a, b) for a,b in combinations(df2.index, r=2)
       if np.isclose(df2.loc[a], df2.loc[b], atol=0.05).sum()>=N]

Output:

[('row1', 'row2'), ('row1', 'row4'), ('row2', 'row4')]

follow-up

My real data is 10,000 rows and I want to filter out all rows that have more than one column of probabilities that are close to each other. Is there a way to do this without specifying pairs

from itertools import combinations

N = 2

df2 = df.set_index('ID')

keep = set()
seen = set()

for a,b in combinations(df2.index, r=2):
    if {a,b}.issubset(seen):
        continue
    if np.isclose(df2.loc[a], df2.loc[b], atol=0.05).sum()>=N:
        keep.update({a, b})
    seen.update({a, b})
    
print(keep)
# {'row1', 'row2', 'row4'}

CodePudding user response：

You can do that with:

Transpose the dataframe to get each sample as column and classes probabilities as rows.

We can check the minimal requirement if the difference between the 2 largest values is less than or equal 0.05.

df = pd.DataFrame(data=d).set_index("ID").T

result = [col for col in df.columns if np.isclose(*df[col].nlargest(2), atol=0.05)]

Output:

['row2', 'row4', 'row5']'

Dataframe after the transpose:

 ID     row1    row2    row3  row4  row5    row6
class1  0.97    0.97    0.7   0.97  0.10    0.10
class2  0.20    0.96    0.5   0.98  0.20    0.11
class3  0.40    0.40    0.3   0.20  0.30    0.30
class4  0.30    0.30    0.4   0.30  0.75    0.90
class5  0.20    0.20    0.5   0.20  0.80    0.20