Given the following dataframe:
col_A | col_B |
---|---|
1 | 1 |
1 | 2 |
1 | 3 |
2 | 4 |
2 | 5 |
2 | 6 |
3 | 7 |
3 | 8 |
3 | 9 |
How would I assign the max value of col_B
from a group in col_A
to all rows in the next (in ascending order) group in col_A
? Output desired is as follows:
col_A | col_B | max_col_B_ from_prev_col_A_group |
---|---|---|
1 | 1 | |
1 | 2 | |
1 | 3 | |
2 | 4 | 3 |
2 | 5 | 3 |
2 | 6 | 3 |
3 | 7 | 6 |
3 | 8 | 6 |
3 | 9 | 6 |
I figured I might be able to get the max
by group, then shift
the result forward and merge
the values back into the source dataframe in some way. However I can't work out the "some way" bit of that solution :(
Any ideas on how to do this? I'm not wedded to my approach at all - just want the fastest solution as there are 1.5m rows and 80k discrete values in col_A
.
Thanks in advance!
Code to reproduce dataframe:
df = {
'col_A': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3},
'col_B': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9},
}
CodePudding user response:
Combine merge
with groupby
shift
:
(df.merge(df.groupby("col_A").col_B.max().shift(1),
on="col_A", how="left")
.rename(columns={"col_B_y": "max_col_B_ from_prev_col_A_group"}))