Home > Software design >  Pandas unusual groupby
Pandas unusual groupby

Time:06-28

I need an unusual group by in a huge dataframe. I simplified the df for the purpose of clarity. Is it possible to group by years including the last visible december date of previous year as part of the next year, but also including this date as part of the actual year?

data = {'date': ['2017-03-31', '2017-04-03', '2017-12-27', '2017-12-28',
                 '2017-12-29', '2018-01-02', '2018-12-31', '2019-01-02',
                 '2019-01-03', '2019-12-31', '2020-12-30', '2020-12-31',
                 '2021-01-20', '2021-12-30', '2021-12-31', '2022-05-30',
                 '2022-05-31'] 
       }
df =pd.DataFrame(data)


Desired groupbys:
date          g1    g2    g3     g4    g5      g6
2017-03-31  2017                    
2017-04-03  2017                    
2017-12-27  2017                    
2017-12-28  2017                    
2017-12-29  2017  2018              
2018-01-02        2018              
2018-12-31        2018  2019            
2019-01-02              2019            
2019-01-03              2019            
2019-12-31              2019  2020      
2020-12-30                    2020      
2020-12-31                    2020  2021    
2021-01-20                          2021    
2021-12-30                          2021    
2021-12-31                          2021    2022
2022-05-30                                  2022
2022-05-31                                  2022

CodePudding user response:

IIUC, you can use a pivot after duplicating the last date of each year:

d = pd.to_datetime(df['date'])

out = (pd
 .concat([df.assign(year=d.dt.year),
          df[df.groupby(d.dt.year, as_index=False).cumcount(ascending=False).eq(0)
            & d.dt.month.eq(12)
            ].assign(year=d.dt.year 1)])
 .assign(col=lambda d: 'g' d.groupby('year').ngroup().add(1).astype(str))
 .pivot_table(index='date', columns='col', values='year')
 .convert_dtypes()
)

output:

col           g1    g2    g3    g4    g5    g6    g7
date                                                
2017-03-31  2017  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2017-04-03  2017  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2017-12-27  2017  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2017-12-28  2017  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2017-12-29  2017  2018  <NA>  <NA>  <NA>  <NA>  <NA>
2018-01-02  <NA>  2018  <NA>  <NA>  <NA>  <NA>  <NA>
2018-12-31  <NA>  2018  2019  <NA>  <NA>  <NA>  <NA>
2019-01-02  <NA>  <NA>  2019  <NA>  <NA>  <NA>  <NA>
2019-01-03  <NA>  <NA>  2019  <NA>  <NA>  <NA>  <NA>
2019-12-31  <NA>  <NA>  2019  2020  <NA>  <NA>  <NA>
2020-12-30  <NA>  <NA>  <NA>  2020  <NA>  <NA>  <NA>
2020-12-31  <NA>  <NA>  <NA>  2020  2021  <NA>  <NA>
2021-01-20  <NA>  <NA>  <NA>  <NA>  2021  <NA>  <NA>
2021-12-30  <NA>  <NA>  <NA>  <NA>  2021  <NA>  <NA>
2021-12-31  <NA>  <NA>  <NA>  <NA>  2021  2022  <NA>
2022-05-30  <NA>  <NA>  <NA>  <NA>  <NA>  2022  <NA>
2022-05-31  <NA>  <NA>  <NA>  <NA>  <NA>  2022  2023

groupby only

d = pd.to_datetime(df['date'])

out = (pd
 .concat([df.assign(year=d.dt.year),
          df[df.groupby(d.dt.year, as_index=False).cumcount(ascending=False).eq(0)]
            .assign(year=d.dt.year 1)])
      .groupby('year')
      # perform your aggregation here

)
  • Related