I need help with groupby in pandas:
I have following df:
A B C D
04547 2022-07-04 2022-07-04 1000000
04547 2022-07-11 2022-07-11 1000000
04547 2022-08-08 2022-08-08 1000000
04547 2022-10-11 2022-10-11 0100000
04547 2022-10-18 2022-10-18 0100000
04547 2022-10-24 2022-10-24 1000000
04547 2022-11-01 2022-11-01 0100000
04547 2022-11-08 2022-11-08 0100000
04548 2022-10-11 2022-10-11 0100000
04548 2022-10-18 2022-10-18 0100000
04548 2022-10-24 2022-10-24 1000000
04548 2022-11-01 2022-11-01 0100000
04548 2022-11-08 2022-11-08 0100000
my needed output should be:
A B C D
04547 2022-07-04 2022-08-08 1000000
04547 2022-10-11 2022-10-18 0100000
04547 2022-10-24 2022-10-24 1000000
04547 2022-11-01 2022-11-08 0100000
04548 2022-10-11 2022-10-18 0100000
04548 2022-10-24 2022-10-24 1000000
04548 2022-11-01 2022-11-08 0100000
but with:
a = {'A':'first','B':'first','C':'last','D':'first'}
df = df.groupby(['A','D']).agg(a)
A B C D
4547 2022-10-11 2022-11-08 0100000
4547 2022-07-04 2022-10-24 1000000
4548 2022-10-11 2022-11-08 0100000
4548 2022-10-24 2022-10-24 1000000
because I have to interrupt grouping when a new series in column D starts seperated for each column A
CodePudding user response:
If need groups by consecutive values of A,D
columns use DataFrame.shift
ed values comapre for not equal by DataFrame.ne
with DataFrame.any
for Series
and then add cumualtive sum by Series.cumsum
:
g = df[['A','D']].ne(df[['A','D']].shift()).any(1).cumsum()
a = {'A':'first','B':'first','C':'last','D':'first'}
df = df.groupby(g).agg(a)
print (df)
A B C D
1 04547 2022-07-04 2022-08-08 1000000
2 04547 2022-10-11 2022-10-18 0100000
3 04547 2022-10-24 2022-10-24 1000000
4 04547 2022-11-01 2022-11-08 0100000
5 04548 2022-10-11 2022-10-18 0100000
6 04548 2022-10-24 2022-10-24 1000000
7 04548 2022-11-01 2022-11-08 0100000
CodePudding user response:
You can use df['D'].ne(df['D'].shift().cumsum()
as grouper the column D to ensure grouping by consecutive values (shift
gives you access to the previous row value):
a = {'A':'first','B':'first','C':'last','D':'first'}
out = (df.groupby(['A', df['D'].ne(df['D'].shift()).cumsum()],
as_index=False)
.agg(a)
)
output:
A B C D
0 04547 2022-07-04 2022-08-08 1000000
1 04547 2022-10-11 2022-10-18 0100000
2 04547 2022-10-24 2022-10-24 1000000
3 04547 2022-11-01 2022-11-08 0100000
4 04548 2022-10-11 2022-10-18 0100000
5 04548 2022-10-24 2022-10-24 1000000
6 04548 2022-11-01 2022-11-08 0100000