I have columns of probabilities in a pandas dataframe as an output from multiclass machine learning.
I am looking to filter rows for which the model had very close probabilities between the classes for that row, and ideally only care about similar values that are similar to the highest value in that row, but I'm not sure where to start.
For example my data looks like this:
ID class1 class2 class3 class4 class5
row1 0.97 0.2 0.4 0.3 0.2
row2 0.97 0.96 0.4 0.3 0.2
row3 0.7 0.5 0.3 0.4 0.5
row4 0.97 0.98 0.99 0.3 0.2
row5 0.1 0.2 0.3 0.78 0.8
row6 0.1 0.11 0.3 0.9 0.2
I'd like to filter for rows where at least 2 (or more) probability class columns have a probability that is close to at least one other probability column in that row (e.g., maybe within 0.05). So an example output would filter to:
ID class1 class2 class3 class4 class5
row2 0.97 0.96 0.4 0.3 0.2
row4 0.97 0.98 0.99 0.3 0.2
row5 0.1 0.2 0.3 0.78 0.8
I don't mind if a filter includes row6
as it also meets my <0.05 different main requirement, but ideally because the 0.05 difference isn't with the largest probability I'd prefer to ignore this too.
What can I do to develop a filter like this?
Example data:
Edit: I have increased the size of my example data, as I do not want pairs specifically but any and all rows that in inside their row their column values for 2 or more probabilities have close values
d = {'ID': ['row1', 'row2', 'row3', 'row4', 'row5', 'row6'],
'class1': [0.97, 0.97, 0.7, 0.97, 0.1, 0.1],
'class2': [0.2, 0.96, 0.5, 0.98, 0.2, 0.11],
'class3': [0.4, 0.4, 0.3, 0.2, 0.3, 0.3],
'class4': [0.3, 0.3, 0.4, 0.3, 0.78, 0.9],
'class5': [0.2, 0.2, 0.5, 0.2, 0.8, 0.2]}
df = pd.DataFrame(data=d)
CodePudding user response:
Here is an example using numpy
and itertools.combinations
to get the pairs of similar rows with at least N
matches with 0.05:
from itertools import combinations
import numpy as np
df2 = df.set_index('ID')
N = 2
out = [(a, b) for a,b in combinations(df2.index, r=2)
if np.isclose(df2.loc[a], df2.loc[b], atol=0.05).sum()>=N]
Output:
[('row1', 'row2'), ('row1', 'row4'), ('row2', 'row4')]
follow-up
My real data is 10,000 rows and I want to filter out all rows that have more than one column of probabilities that are close to each other. Is there a way to do this without specifying pairs
from itertools import combinations
N = 2
df2 = df.set_index('ID')
keep = set()
seen = set()
for a,b in combinations(df2.index, r=2):
if {a,b}.issubset(seen):
continue
if np.isclose(df2.loc[a], df2.loc[b], atol=0.05).sum()>=N:
keep.update({a, b})
seen.update({a, b})
print(keep)
# {'row1', 'row2', 'row4'}
CodePudding user response:
You can do that with:
Transpose the dataframe to get each sample as column and classes probabilities as rows.
We can check the minimal requirement if the difference between the 2 largest values is less than or equal 0.05.
df = pd.DataFrame(data=d).set_index("ID").T result = [col for col in df.columns if np.isclose(*df[col].nlargest(2), atol=0.05)]
Output:
['row2', 'row4', 'row5']'
Dataframe after the transpose:
ID row1 row2 row3 row4 row5 row6
class1 0.97 0.97 0.7 0.97 0.10 0.10
class2 0.20 0.96 0.5 0.98 0.20 0.11
class3 0.40 0.40 0.3 0.20 0.30 0.30
class4 0.30 0.30 0.4 0.30 0.75 0.90
class5 0.20 0.20 0.5 0.20 0.80 0.20