Home > Software engineering >  Remove consecutive duplicates while keeping the max value
Remove consecutive duplicates while keeping the max value

Time:12-22

I am trying to remove consecutive duplicates from column X while keeping the entry with the max value based on column Y, unfortunately with no success. The data frame is as follow:

idx X Y
0 A 3
1 B 2
2 A 7
3 A 10
4 B 1
5 C 4
6 A 3
7 A 5

What I want to achieve is:

idx X Y
0 A 3
1 B 2
2 A 10
3 B 1
4 C 4
5 A 5

Most of the solutions I found just remove the duplicates tout court without accounting for any repeating pattern.

CodePudding user response:

You need to apply an itertools-style-groupby and then keep the rows where Y is maximal.

>>> df 
   idx  X   Y
0    0  A   3
1    1  B   2
2    2  A   7
3    3  A  10
4    4  B   1
5    5  C   4
6    6  A   3
7    7  A   5
>>> y_max = df.groupby(df['X'].ne(df['X'].shift()).cumsum())['Y'].transform('max')
>>> df[df['Y'] == y_max] 
   idx  X   Y
0    0  A   3
1    1  B   2
3    3  A  10
4    4  B   1
5    5  C   4
7    7  A   5

edit:

Initial solution had a bug and only produced the correct idx column by accident.

CodePudding user response:

Or I'd prefer just simply only specify the groups in the groupby parameters:

df.groupby(df['X'].ne(df['X'].shift()).cumsum(), as_index=False).max()

Or:

df.groupby(df['X'].ne(df['X'].shift()).cumsum()).max().reset_index(drop=True)

Both output:

   idx  X   Y
0    0  A   3
1    1  B   2
2    3  A  10
3    4  B   1
4    5  C   4
5    7  A   5
  • Related