I am trying to remove consecutive duplicates from column X while keeping the entry with the max value based on column Y, unfortunately with no success. The data frame is as follow:
idx | X | Y |
---|---|---|
0 | A | 3 |
1 | B | 2 |
2 | A | 7 |
3 | A | 10 |
4 | B | 1 |
5 | C | 4 |
6 | A | 3 |
7 | A | 5 |
What I want to achieve is:
idx | X | Y |
---|---|---|
0 | A | 3 |
1 | B | 2 |
2 | A | 10 |
3 | B | 1 |
4 | C | 4 |
5 | A | 5 |
Most of the solutions I found just remove the duplicates tout court without accounting for any repeating pattern.
CodePudding user response:
You need to apply an itertools-style-groupby and then keep the rows where Y is maximal.
>>> df
idx X Y
0 0 A 3
1 1 B 2
2 2 A 7
3 3 A 10
4 4 B 1
5 5 C 4
6 6 A 3
7 7 A 5
>>> y_max = df.groupby(df['X'].ne(df['X'].shift()).cumsum())['Y'].transform('max')
>>> df[df['Y'] == y_max]
idx X Y
0 0 A 3
1 1 B 2
3 3 A 10
4 4 B 1
5 5 C 4
7 7 A 5
edit:
Initial solution had a bug and only produced the correct idx column by accident.
CodePudding user response:
Or I'd prefer just simply only specify the groups in the groupby
parameters:
df.groupby(df['X'].ne(df['X'].shift()).cumsum(), as_index=False).max()
Or:
df.groupby(df['X'].ne(df['X'].shift()).cumsum()).max().reset_index(drop=True)
Both output:
idx X Y
0 0 A 3
1 1 B 2
2 3 A 10
3 4 B 1
4 5 C 4
5 7 A 5