TLDR: When downsampling a Series with a DatetimeIndex, e.g. from hourly to daily values, how can I ensure the result only contains time periods that are fully present in the original?

Example

I'll explain with a simplified example.

Starting point: daily values

import pandas as pd

# Source data: 2 full days, AND SOME ADDITIONAL HOURS.
i = pd.date_range('2022-03-04 22:00', '2022-03-07 09:00', freq='H')
hourly = pd.Series(range(len(i)), i)

I want to resample to days, but keep only those that days are completely present in the source series.

What is working: calendar days

If a day is defined as a normal calendar day, i.e., midnight to midnight, we can do this in 2 steps:

# 1) Resample.
grouper = pd.Grouper(freq='D')
daily = hourly.groupby(grouper).sum() # or .resample('D').sum()
# 2022-03-04      1
# 2022-03-05    324
# 2022-03-06    900
# 2022-03-07    545
# Freq: D, dtype: int64

# 2) Discard incomplete days.
# (reject the days that start before the start of the first hour)
incomplete_left = daily.index < hourly.index[0]
# (reject the days that end after the end of the last hour)
incomplete_right = daily.index   pd.offsets.Day(1) > hourly.index[-1]   pd.offsets.Hour(1)
# Trim.
daily_trimmed = daily[~incomplete_left & ~incomplete_right] # Keeps 2022-03-05 and -06. Good.
# 2022-03-05    324
# 2022-03-06    900
# Freq: D, dtype: int64

Sofar, so good.

What is not working: custom starting point

But what if a day is defined as starting at 06:00 and ending at 06:00 the next calender day? I can do the resampling, but don't know how to check which timestamps to reject.

# 1) Resampling is doable:
import datetime
def gasday(ts: pd.Timestamp) -> pd.Timestamp:
    day = ts.floor("D")
    if ts.time() < datetime.time(hour=6):  
        day = day - pd.DateOffset(days=1) # get previous day
    return day

daily2 = hourly.groupby(gasday).sum()
# 2022-03-04      28
# 2022-03-05     468
# 2022-03-06    1044
# 2022-03-07     230
# dtype: int64

# 2) ... but how to find the days that must be rejected??

Remarks

I'm using DatetimeIndex, instead of PeriodIndex, which is why I we have the somewhat complicated formula for incomplete_right. The reason for using DatetimeIndex is that I'm generally dealing with timezones (not shown in this example). The timestamps in the datetimeindex are left-bound.
In my use-case, I'm given the grouper function (gasday in this case), without knowing, what the cutoff time is (06:00 in this case).

CodePudding user response：

If your data is guaranteed to be hourly, then you can just count the records:

daily = hourly.groupby(pd.Grouper(freq='D', offset='6H')).agg(['size','sum'])

Output:

                     size   sum
2022-03-04 06:00:00     8    28
2022-03-05 06:00:00    24   468
2022-03-06 06:00:00    24  1044
2022-03-07 06:00:00     4   230

Looking form the data, it's fairly easy to see which ones should be dropped

complete_daily = daily.query('size==24')

Output:

                     size   sum
2022-03-05 06:00:00    24   468
2022-03-06 06:00:00    24  1044

Update: You can also try:

daily = (hourly.reset_index().groupby(pd.Grouper(key='index', freq='D', offset='6H'))
               .agg(start=('index','min'), end=('index','max'), total=(0,'sum'))
        )

Output:

                                  start                 end  total
index                                                             
2022-03-04 06:00:00 2022-03-04 22:00:00 2022-03-05 05:00:00     28
2022-03-05 06:00:00 2022-03-05 06:00:00 2022-03-06 05:00:00    468
2022-03-06 06:00:00 2022-03-06 06:00:00 2022-03-07 05:00:00   1044
2022-03-07 06:00:00 2022-03-07 06:00:00 2022-03-07 09:00:00    230

You can then query for complete days, e.g.

daily['start'].dt.hour.eq(6) & daily['end'].dt.hour.eq(5)

CodePudding user response：

I've decided to use the following logic:

If the first timestamp in hourly.index belongs to the same group (= same day) as the timestamp immediately before it, then the group is not fully present in the hourly series and the first datapoint in daily must be removed. If it does not belong to the same group, it is really the start of a new group, the group is fully present in hourly, and no change is needed to daily.
Likewise, if the end of the last timestamp in hourly.index belongs to the same day as the timestamp immediately before* it, then the group is also not fully present in the hourly series, and the final datapoint in daily must be removed.

eps = pd.Timedelta(seconds=1)
start = hourly.index[0]
if gasday(start) == gasday(start - eps):
    daily2 = daily2.iloc[1:]
end = hourly.index[-1]   pd.offsets.Hour(1)
if gasday(end - eps) == gasday(end):
    daily2 = daily2.iloc[:-1]

This works, and keeps the two days (2022-03-05 and -06) as wanted. It includes (-04) if we start i at 2022-03-04 06:00 or earlier. Likewise, it keeps (-06) only if we end i at 2022-03-07 05:00 or later.

*) Why before? Well, the 05:00 timestamp denotes the left-closed interval [05:00-06:00). 06:00 is actually the start of the next hour. Therefor, if this 06:00 timestamp belongs to the same day, as the moment immediately before it (05:59:59), then we do not have the complete day.

Now the only issues I have left is the following: I'd like to abstract this all away, like so:

def resample_and_trim(source, grouper):
    agg = source.groupby(grouper).sum()

    eps = pd.Timedelta(seconds=1)
    start = source.index[0]
    if grouper(start) == grouper(start - eps):
        agg = agg.iloc[1:]
    end = source.index[-1]   pd.offsets.Hour(1)
    if grouper(end - eps) == grouper(end):
        agg = agg.iloc[:-1]
    return agg

And then be able to call this in both cases. The latter works:

daily2 = resample_and_trim(hourly, gasday)
# 2022-03-05     468
# 2022-03-06    1044
# dtype: int64

But the former does not:

daily = resample_and_trim(hourly, pd.Grouper(freq='H'))
# Error in `grouper(start)`
# TypeError: 'TimeGrouper' object is not callable

I'll doctor around a bit more; if I find the solution, I'll edit this answer.