Home > Back-end >  separate groupby in python pandas
separate groupby in python pandas

Time:07-01

I need help with groupby in pandas:

I have following df:

A     B          C          D
04547 2022-07-04 2022-07-04 1000000
04547 2022-07-11 2022-07-11 1000000
04547 2022-08-08 2022-08-08 1000000
04547 2022-10-11 2022-10-11 0100000
04547 2022-10-18 2022-10-18 0100000
04547 2022-10-24 2022-10-24 1000000
04547 2022-11-01 2022-11-01 0100000
04547 2022-11-08 2022-11-08 0100000
04548 2022-10-11 2022-10-11 0100000
04548 2022-10-18 2022-10-18 0100000
04548 2022-10-24 2022-10-24 1000000
04548 2022-11-01 2022-11-01 0100000
04548 2022-11-08 2022-11-08 0100000

my needed output should be:

    A           B           C        D
04547  2022-07-04  2022-08-08  1000000
04547  2022-10-11  2022-10-18  0100000
04547  2022-10-24  2022-10-24  1000000
04547  2022-11-01  2022-11-08  0100000
04548  2022-10-11  2022-10-18  0100000
04548  2022-10-24  2022-10-24  1000000
04548  2022-11-01  2022-11-08  0100000

but with:

a = {'A':'first','B':'first','C':'last','D':'first'}
df = df.groupby(['A','D']).agg(a)

A     B          C          D
4547 2022-10-11 2022-11-08  0100000
4547 2022-07-04 2022-10-24  1000000
4548 2022-10-11 2022-11-08  0100000
4548 2022-10-24 2022-10-24  1000000

because I have to interrupt grouping when a new series in column D starts seperated for each column A

CodePudding user response:

If need groups by consecutive values of A,D columns use DataFrame.shifted values comapre for not equal by DataFrame.ne with DataFrame.any for Series and then add cumualtive sum by Series.cumsum:

g = df[['A','D']].ne(df[['A','D']].shift()).any(1).cumsum()
a = {'A':'first','B':'first','C':'last','D':'first'}
df = df.groupby(g).agg(a)
print (df)
       A           B           C        D
1  04547  2022-07-04  2022-08-08  1000000
2  04547  2022-10-11  2022-10-18  0100000
3  04547  2022-10-24  2022-10-24  1000000
4  04547  2022-11-01  2022-11-08  0100000
5  04548  2022-10-11  2022-10-18  0100000
6  04548  2022-10-24  2022-10-24  1000000
7  04548  2022-11-01  2022-11-08  0100000

CodePudding user response:

You can use df['D'].ne(df['D'].shift().cumsum() as grouper the column D to ensure grouping by consecutive values (shift gives you access to the previous row value):

a = {'A':'first','B':'first','C':'last','D':'first'}
out = (df.groupby(['A', df['D'].ne(df['D'].shift()).cumsum()],
                  as_index=False)
         .agg(a)
      )

output:

       A           B           C        D
0  04547  2022-07-04  2022-08-08  1000000
1  04547  2022-10-11  2022-10-18  0100000
2  04547  2022-10-24  2022-10-24  1000000
3  04547  2022-11-01  2022-11-08  0100000
4  04548  2022-10-11  2022-10-18  0100000
5  04548  2022-10-24  2022-10-24  1000000
6  04548  2022-11-01  2022-11-08  0100000
  • Related