I have a dataframe:
df_test = pd.DataFrame({'col': ['paris', 'paris', 'nantes', 'berlin', 'berlin', 'berlin', 'tokyo'],
'id_res': [12, 12, 14, 28, 8, 4, 89]})
col id_res
0 paris 12
1 paris 12
2 nantes 14
3 berlin 28
4 berlin 8
5 berlin 4
6 tokyo 89
I want to create a "check" column whose values are as follows:
- If a value in "col" has a duplicate and these duplicates have the same id_res, the value of "check" is False for duplicates
- If a value in "col" has duplicates and the "id_res" of these duplicates are different, assign True in "check" for the largest "id_res" value and False for the smallest
- If a value in "col" has no duplicates, the value of "check" is False.
The output I want is therefore:
col id_res check
0 paris 12 False
1 paris 12 False
2 nantes 14 False
3 berlin 28 True
4 berlin 8 False
5 berlin 4 False
6 tokyo 89 False
I tried with groupby but no satisfactory result. Can anyone help me plz
CodePudding user response:
Create 2 boolean masks then combine them and find the highest id_res
value per col
:
m1 = df['col'].duplicated(keep=False)
m2 = ~df['id_res'].duplicated(keep=False)
df['check'] = df.index.isin(df[m1 & m2].groupby('col')['id_res'].idxmax())
print(df)
# Output
col id_res check
0 paris 12 False
1 paris 12 False
2 nantes 14 False
3 berlin 28 True
4 berlin 8 False
5 berlin 4 False
6 tokyo 89 False
Details:
>>> pd.concat([df, m1.rename('m1'), m2.rename('m2')])
col id_res check m1 m2
0 paris 12 False True False
1 paris 12 False True False
2 nantes 14 False False True
3 berlin 28 True True True # <- group to check
4 berlin 8 False True True # <- because
5 berlin 4 False True True # <- m1 and m2 are True
6 tokyo 89 False False True
CodePudding user response:
You basically have 3 conditions, so use masks and take the logical intersection (AND/&
):
g = df_test.groupby('col')['id_res']
# is col duplicated?
m1 = df_test['col'].duplicated(keep=False)
# [ True True False True True True False]
# is id_res max of its group?
m2 = df_test['id_res'].eq(g.transform('max'))
# [ True True True True False False True]
# is group diverse? (more than 1 id_res)
m3 = g.transform('nunique').gt(1)
# [False False False True True True False]
# check if all conditions True
df_test['check'] = m1&m2&m3
Output:
col id_res check
0 paris 12 False
1 paris 12 False
2 nantes 14 False
3 berlin 28 True
4 berlin 8 False
5 berlin 4 False
6 tokyo 89 False