Home > database >  Find duplicates in dataframe and keep only the highest ones
Find duplicates in dataframe and keep only the highest ones

Time:11-11

I'm trying to find the duplicates in a dataframe that are higher so I can remove these later from another dataframe based on the index so the main dataframe is left with no duplicates and only the lowest value.

Basically let's say we have this dataframe (there could be multiple instances of the 2 groups):

index   group   value
  1       1      402
  2       1      396
  3       2      406
  4       2      416
  5       2      407
  6       1      200
  7       2      350

What I need is to only keep the duplicates that have the highest values and remove the lowest one. So the resulting dataframe would be:

index   group   value
  1       1      402
  4       2      416
  5       2      407

Speed is important too and there has to be no lookahead.

CodePudding user response:

Use groupby transform to mask the minimum values of each group. Then use the mask to select only the desired rows.

# map each consecutive group of rows to a different integer
group_labels = (df.group != df.group.shift()).cumsum()

# find the minimum value of each group 
group_min_val = df.groupby(group_labels)['value'].transform('min')

# get only the rows of each group whose value is higher than the minimum 
res = df[df.value != group_min_val]

>>> res

   index  group  value
0      1      1    402
3      4      2    416
4      5      2    407

Intermediate Results


>>> group_labels

0    1
1    1
2    2
3    2
4    2
5    3
6    4
Name: group, dtype: int64

>>> group_min_val

0    396
1    396
2    406
3    406
4    406
5    200
6    350
Name: value, dtype: int64

>>> df.value != group_min_val

0     True
1    False
2    False
3     True
4     True
5    False
6    False
Name: value, dtype: bool

CodePudding user response:

A one-liner version of @HarryPlotter's answer:

df.loc[df.value.ne(df.groupby(df.group.ne(df.group.shift()).cumsum()).value.transform('min'))]

using the trick from here to apply OP's understanding of "grouping", and transforming to get the minimum value in each group, and then .loc-ing for all values not equal to those.

WARNING: this drops any singleton "groups"! (Where OP's comments indicate "keep all but the lowest," this discards singleton values that would technically also be the 'highest' in their group.)

  • Related