Consider the following example
import datetime
import pandas as pd
df = pd.DataFrame({'var' : np.random.randint(0,10, size = 16993)},
index = pd.date_range('2021-01-01 00:00:00',
'2021-03-01 00:00:00',
freq ='5 min'))
df.head(10)
Out[272]:
var
2021-01-01 00:00:00 5
2021-01-01 00:05:00 9
2021-01-01 00:10:00 1
2021-01-01 00:15:00 5
2021-01-01 00:20:00 8
This is a classic dataframe with intraday data. I need to aggregate it at the daily level, but I do not want to use the 24 hours from 00:00:00
to 23:59:59
that would be implicitly used if I ran df.groupby(df.index.date)
df.groupby(df.index.date).agg('sum').head()
Out[274]:
var
2021-01-01 1440
2021-01-02 1238
2021-01-03 1274
2021-01-04 1277
2021-01-05 1253
Rather, I want to aggregate using a 24H windows that starts at 1:00:00
every day. So the first group will be between 2021-01-01 01:00:00
and 2021-02-01 00:59:59
, the second between 2021-02-01 01:00:00
and 2021-03-01 00:59:59
, and so on. The starting times can easily be identified, as shown below.
df['flag'] = df.index.time == datetime.time(1,0)
What is the best way to proceed to create the groups? Thanks!
CodePudding user response:
Resample
with offset
The offset parameter is the Timedelta
that can be added to the origin to create custom grouping intervals
df.resample('D', offset=pd.Timedelta(hours=1)).sum()
var
2020-12-31 01:00:00 72
2021-01-01 01:00:00 1278
2021-01-02 01:00:00 1301
2021-01-03 01:00:00 1291
2021-01-04 01:00:00 1346
....
CodePudding user response:
You can use groupby.sum
if you subtract 1H
from the grouper:
df.groupby((df.index - pd.Timedelta('1H')).date).sum()
# var
# 2020-12-31 58
# 2021-01-01 1319
# 2021-01-02 1261
# ... ...
# 2021-02-26 1286
# 2021-02-27 1246
# 2021-02-28 1240