I'm trying to find the duplicates in a dataframe that are higher so I can remove these later from another dataframe based on the index so the main dataframe is left with no duplicates and only the lowest value.
Basically let's say we have this dataframe (there could be multiple instances of the 2 groups):
index group value
1 1 402
2 1 396
3 2 406
4 2 416
5 2 407
6 1 200
7 2 350
What I need is to only keep the duplicates that have the highest values and remove the lowest one. So the resulting dataframe would be:
index group value
1 1 402
4 2 416
5 2 407
Speed is important too and there has to be no lookahead.
CodePudding user response:
Use groupby
transform
to mask the minimum values of each group. Then use the mask to select only the desired rows.
# map each consecutive group of rows to a different integer
group_labels = (df.group != df.group.shift()).cumsum()
# find the minimum value of each group
group_min_val = df.groupby(group_labels)['value'].transform('min')
# get only the rows of each group whose value is higher than the minimum
res = df[df.value != group_min_val]
>>> res
index group value
0 1 1 402
3 4 2 416
4 5 2 407
Intermediate Results
>>> group_labels
0 1
1 1
2 2
3 2
4 2
5 3
6 4
Name: group, dtype: int64
>>> group_min_val
0 396
1 396
2 406
3 406
4 406
5 200
6 350
Name: value, dtype: int64
>>> df.value != group_min_val
0 True
1 False
2 False
3 True
4 True
5 False
6 False
Name: value, dtype: bool
CodePudding user response:
A one-liner version of @HarryPlotter's answer:
df.loc[df.value.ne(df.groupby(df.group.ne(df.group.shift()).cumsum()).value.transform('min'))]
using the trick from here to apply OP's understanding of "grouping", and transforming to get the minimum value in each group, and then .loc
-ing for all values not equal to those.
WARNING: this drops any singleton "groups"! (Where OP's comments indicate "keep all but the lowest," this discards singleton values that would technically also be the 'highest' in their group.)