I need an unusual group by in a huge dataframe. I simplified the df for the purpose of clarity. Is it possible to group by years including the last visible december date of previous year as part of the next year, but also including this date as part of the actual year?
data = {'date': ['2017-03-31', '2017-04-03', '2017-12-27', '2017-12-28',
'2017-12-29', '2018-01-02', '2018-12-31', '2019-01-02',
'2019-01-03', '2019-12-31', '2020-12-30', '2020-12-31',
'2021-01-20', '2021-12-30', '2021-12-31', '2022-05-30',
'2022-05-31']
}
df =pd.DataFrame(data)
Desired groupbys:
date g1 g2 g3 g4 g5 g6
2017-03-31 2017
2017-04-03 2017
2017-12-27 2017
2017-12-28 2017
2017-12-29 2017 2018
2018-01-02 2018
2018-12-31 2018 2019
2019-01-02 2019
2019-01-03 2019
2019-12-31 2019 2020
2020-12-30 2020
2020-12-31 2020 2021
2021-01-20 2021
2021-12-30 2021
2021-12-31 2021 2022
2022-05-30 2022
2022-05-31 2022
CodePudding user response:
IIUC, you can use a pivot
after duplicating the last date of each year:
d = pd.to_datetime(df['date'])
out = (pd
.concat([df.assign(year=d.dt.year),
df[df.groupby(d.dt.year, as_index=False).cumcount(ascending=False).eq(0)
& d.dt.month.eq(12)
].assign(year=d.dt.year 1)])
.assign(col=lambda d: 'g' d.groupby('year').ngroup().add(1).astype(str))
.pivot_table(index='date', columns='col', values='year')
.convert_dtypes()
)
output:
col g1 g2 g3 g4 g5 g6 g7
date
2017-03-31 2017 <NA> <NA> <NA> <NA> <NA> <NA>
2017-04-03 2017 <NA> <NA> <NA> <NA> <NA> <NA>
2017-12-27 2017 <NA> <NA> <NA> <NA> <NA> <NA>
2017-12-28 2017 <NA> <NA> <NA> <NA> <NA> <NA>
2017-12-29 2017 2018 <NA> <NA> <NA> <NA> <NA>
2018-01-02 <NA> 2018 <NA> <NA> <NA> <NA> <NA>
2018-12-31 <NA> 2018 2019 <NA> <NA> <NA> <NA>
2019-01-02 <NA> <NA> 2019 <NA> <NA> <NA> <NA>
2019-01-03 <NA> <NA> 2019 <NA> <NA> <NA> <NA>
2019-12-31 <NA> <NA> 2019 2020 <NA> <NA> <NA>
2020-12-30 <NA> <NA> <NA> 2020 <NA> <NA> <NA>
2020-12-31 <NA> <NA> <NA> 2020 2021 <NA> <NA>
2021-01-20 <NA> <NA> <NA> <NA> 2021 <NA> <NA>
2021-12-30 <NA> <NA> <NA> <NA> 2021 <NA> <NA>
2021-12-31 <NA> <NA> <NA> <NA> 2021 2022 <NA>
2022-05-30 <NA> <NA> <NA> <NA> <NA> 2022 <NA>
2022-05-31 <NA> <NA> <NA> <NA> <NA> 2022 2023
groupby only
d = pd.to_datetime(df['date'])
out = (pd
.concat([df.assign(year=d.dt.year),
df[df.groupby(d.dt.year, as_index=False).cumcount(ascending=False).eq(0)]
.assign(year=d.dt.year 1)])
.groupby('year')
# perform your aggregation here
)