Here's my table:
category | number | probability |
---|---|---|
1102 | 24 | 0.3 |
1102 | 18 | 0.6 |
1102 | 16 | 0.1 |
2884 | 24 | 0.16 |
2884 | 15 | 0.8 |
2884 | 10 | 0.04 |
so I want to replace the number column that has probability lower than 15% with the number that has the highest probability within groups:
category | number | probability |
---|---|---|
1102 | 24 | 0.3 |
1102 | 18 | 0.6 |
1102 | 18 | 0.1 |
2884 | 24 | 0.16 |
2884 | 15 | 0.8 |
2884 | 15 | 0.04 |
CodePudding user response:
Find the number corresponding to max prob in a group then use loc to update values
n = df.sort_values('probability').groupby('category')['number'].transform('last')
df.loc[df['probability'] <= 0.15, 'number'] = n
category number probability
0 1102 24 0.30
1 1102 18 0.60
2 1102 18 0.10
3 2884 24 0.16
4 2884 15 0.80
5 2884 15 0.04
CodePudding user response:
Use drop_duplicates
to get the number with highest probabilities, then replace with np.where
:
highest_prob = df.sort_values('probability').drop_duplicates('category', keep='last').set_index('category')['number')
df['number'] = np.where(df['probability'] < 0.15, df['category'].map(highest_prob), df['number'])
CodePudding user response:
A possible solution using idxmax
and numpy.where
:
ser = df.groupby("category")["number"].transform("idxmax")
df["number"] = np.where(df["probability"].lt(0.15), ser , df["number"])
Output :
print(df)
category number probability
0 1102 24 0.30
1 1102 18 0.60
2 1002 2 0.10
3 2884 24 0.16
4 2884 15 0.80
5 2884 3 0.04