TLDR: When downsampling a Series with a DatetimeIndex
, e.g. from hourly to daily values, how can I ensure the result only contains time periods that are fully present in the original?
Example
I'll explain with a simplified example.
Starting point: daily values
import pandas as pd
# Source data: 2 full days, AND SOME ADDITIONAL HOURS.
i = pd.date_range('2022-03-04 22:00', '2022-03-07 09:00', freq='H')
hourly = pd.Series(range(len(i)), i)
I want to resample to days, but keep only those that days are completely present in the source series.
What is working: calendar days
If a day is defined as a normal calendar day, i.e., midnight to midnight, we can do this in 2 steps:
# 1) Resample.
grouper = pd.Grouper(freq='D')
daily = hourly.groupby(grouper).sum() # or .resample('D').sum()
# 2022-03-04 1
# 2022-03-05 324
# 2022-03-06 900
# 2022-03-07 545
# Freq: D, dtype: int64
# 2) Discard incomplete days.
# (reject the days that start before the start of the first hour)
incomplete_left = daily.index < hourly.index[0]
# (reject the days that end after the end of the last hour)
incomplete_right = daily.index pd.offsets.Day(1) > hourly.index[-1] pd.offsets.Hour(1)
# Trim.
daily_trimmed = daily[~incomplete_left & ~incomplete_right] # Keeps 2022-03-05 and -06. Good.
# 2022-03-05 324
# 2022-03-06 900
# Freq: D, dtype: int64
Sofar, so good.
What is not working: custom starting point
But what if a day is defined as starting at 06:00 and ending at 06:00 the next calender day? I can do the resampling, but don't know how to check which timestamps to reject.
# 1) Resampling is doable:
import datetime
def gasday(ts: pd.Timestamp) -> pd.Timestamp:
day = ts.floor("D")
if ts.time() < datetime.time(hour=6):
day = day - pd.DateOffset(days=1) # get previous day
return day
daily2 = hourly.groupby(gasday).sum()
# 2022-03-04 28
# 2022-03-05 468
# 2022-03-06 1044
# 2022-03-07 230
# dtype: int64
# 2) ... but how to find the days that must be rejected??
Remarks
I'm using
DatetimeIndex
, instead ofPeriodIndex
, which is why I we have the somewhat complicated formula forincomplete_right
. The reason for usingDatetimeIndex
is that I'm generally dealing with timezones (not shown in this example). The timestamps in the datetimeindex are left-bound.In my use-case, I'm given the
grouper
function (gasday
in this case), without knowing, what the cutoff time is (06:00 in this case).
CodePudding user response:
If your data is guaranteed to be hourly, then you can just count the records:
daily = hourly.groupby(pd.Grouper(freq='D', offset='6H')).agg(['size','sum'])
Output:
size sum
2022-03-04 06:00:00 8 28
2022-03-05 06:00:00 24 468
2022-03-06 06:00:00 24 1044
2022-03-07 06:00:00 4 230
Looking form the data, it's fairly easy to see which ones should be dropped
complete_daily = daily.query('size==24')
Output:
size sum
2022-03-05 06:00:00 24 468
2022-03-06 06:00:00 24 1044
Update: You can also try:
daily = (hourly.reset_index().groupby(pd.Grouper(key='index', freq='D', offset='6H'))
.agg(start=('index','min'), end=('index','max'), total=(0,'sum'))
)
Output:
start end total
index
2022-03-04 06:00:00 2022-03-04 22:00:00 2022-03-05 05:00:00 28
2022-03-05 06:00:00 2022-03-05 06:00:00 2022-03-06 05:00:00 468
2022-03-06 06:00:00 2022-03-06 06:00:00 2022-03-07 05:00:00 1044
2022-03-07 06:00:00 2022-03-07 06:00:00 2022-03-07 09:00:00 230
You can then query for complete days, e.g.
daily['start'].dt.hour.eq(6) & daily['end'].dt.hour.eq(5)
CodePudding user response:
I've decided to use the following logic:
If the first timestamp in
hourly.index
belongs to the same group (= same day) as the timestamp immediately before it, then the group is not fully present in thehourly
series and the first datapoint indaily
must be removed. If it does not belong to the same group, it is really the start of a new group, the group is fully present inhourly
, and no change is needed todaily
.Likewise, if the end of the last timestamp in
hourly.index
belongs to the same day as the timestamp immediately before* it, then the group is also not fully present in thehourly
series, and the final datapoint indaily
must be removed.
eps = pd.Timedelta(seconds=1)
start = hourly.index[0]
if gasday(start) == gasday(start - eps):
daily2 = daily2.iloc[1:]
end = hourly.index[-1] pd.offsets.Hour(1)
if gasday(end - eps) == gasday(end):
daily2 = daily2.iloc[:-1]
This works, and keeps the two days (2022-03-05 and -06) as wanted. It includes (-04) if we start i
at 2022-03-04 06:00
or earlier. Likewise, it keeps (-06) only if we end i
at 2022-03-07 05:00
or later.
*) Why before? Well, the 05:00
timestamp denotes the left-closed interval [05:00-06:00)
. 06:00 is actually the start of the next hour. Therefor, if this 06:00 timestamp belongs to the same day, as the moment immediately before it (05:59:59), then we do not have the complete day.
Now the only issues I have left is the following: I'd like to abstract this all away, like so:
def resample_and_trim(source, grouper):
agg = source.groupby(grouper).sum()
eps = pd.Timedelta(seconds=1)
start = source.index[0]
if grouper(start) == grouper(start - eps):
agg = agg.iloc[1:]
end = source.index[-1] pd.offsets.Hour(1)
if grouper(end - eps) == grouper(end):
agg = agg.iloc[:-1]
return agg
And then be able to call this in both cases. The latter works:
daily2 = resample_and_trim(hourly, gasday)
# 2022-03-05 468
# 2022-03-06 1044
# dtype: int64
But the former does not:
daily = resample_and_trim(hourly, pd.Grouper(freq='H'))
# Error in `grouper(start)`
# TypeError: 'TimeGrouper' object is not callable
I'll doctor around a bit more; if I find the solution, I'll edit this answer.