resampling with origin='end

I don't understand what origin='end_day' does.

The docs give the following example:

>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
>>> rng = pd.date_range(start, end, freq='7min')
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts 
2000-10-01 23:30:00     0
2000-10-01 23:37:00     3
2000-10-01 23:44:00     6
2000-10-01 23:51:00     9
2000-10-01 23:58:00    12
2000-10-02 00:05:00    15
2000-10-02 00:12:00    18
2000-10-02 00:19:00    21
2000-10-02 00:26:00    24
Freq: 7T, dtype: int32
>>> ts.resample('17min', origin='end_day').sum()
2000-10-01 23:38:00     3
2000-10-01 23:55:00    15
2000-10-02 00:12:00    45
2000-10-02 00:29:00    45
Freq: 17T, dtype: int32

The docs explain origin='end_day' like this:

‘end_day’: origin is the ceiling midnight of the last day

So as far as I understand, the line

ts.resample('17min', origin='end_day').sum()

should be equivalent to

ts.resample('17min', origin=ts.index.max().ceil('1d')).sum()

However, passing the timestamp ts.index.max().ceil('1d') produces a different result:

>>> ts.resample('17min', origin=ts.index.max().ceil('1d')).sum() 
2000-10-01 23:21:00     3
2000-10-01 23:38:00    15
2000-10-01 23:55:00    27
2000-10-02 00:12:00    63

I'm looking for an explanation for this discrepancy and maybe a better general description of the 'end_day' argument than the docs provide.

edit: I'm using pandas 1.3.5

CodePudding user response：

The real equivalent of origin='end_day' is:

>>> ts.resample('17min', origin=ts.index.max().ceil('D'), 
                closed='right', label='right').sum()

2000-10-01 23:38:00     3
2000-10-01 23:55:00    15
2000-10-02 00:12:00    45
2000-10-02 00:29:00    45
Freq: 17T, dtype: int64

Update 1:

What if I use origin='end_day' but also explicitly pass in closed and label not being 'right'? Where's the behavior defined for this?

From source code of resample:

            # The backward resample sets ``closed`` to ``'right'`` by default
            # since the last value should be considered as the edge point for
            # the last bin. When origin in "end" or "end_day", the value for a
            # specific ``Timestamp`` index stands for the resample result from
            # the current ``Timestamp`` minus ``freq`` to the current
            # ``Timestamp`` with a right close.
            if origin in ["end", "end_day"]:
                if closed is None:
                    closed = "right"
                if label is None:
                    label = "right"
            else:
                if closed is None:
                    closed = "left"
                if label is None:
                    label = "left"

Update 2a:

Consider df = pd.DataFrame(index=pd.date_range(start='2021-04-22 01:00:00', end='2021-04-28 01:00', freq='1d'), data=range(7)). Now df.resample(rule='7d', origin='end_day') crashes with a ValueError.

If you don't set explicitly closed parameter, resample set it to right because origin='end_day' (see above). So the origin is now '2021-04-29' and the first bin value is '2021-04-22' excluded. You have a situation where Values falls before first bin:

df = pd.DataFrame(index=pd.date_range(start='2021-04-22 01:00:00', end='2021-04-28 01:00', freq='1d'), data=range(7))
df.resample(rule='7d', origin='end_day', closed='left')  # <- HERE

Update 2b:

If '2021-04-22' is the first bin, which timestamp does fall outside of it? '2021-04-22 01:00:00' is later, right?

df = pd.DataFrame(index=pd.date_range(start='2021-04-21 01:00:00', end='2021-04-28 01:00', freq='1d'), data=range(8))
print(df)

# Output:
                     0
2021-04-21 01:00:00  0
2021-04-22 01:00:00  1
2021-04-23 01:00:00  2
2021-04-24 01:00:00  3
2021-04-25 01:00:00  4
2021-04-26 01:00:00  5
2021-04-27 01:00:00  6
2021-04-28 01:00:00  7

With this sample, I think it should be clearer for you:

# closed='right' (default)
>>> df.resample(rule='7d', origin='end_day').sum()
             0
2021-04-22   1  # ('2021-04-15', '2021-04-22']
2021-04-29  27  # ('2021-04-22', '2021-04-29']

# closed='left'
>>> df.resample(rule='7d', origin='end_day', closed='left').sum()
             0
2021-04-22   0  # ['2021-04-15', '2021-04-22')
2021-04-29  28  # ['2021-04-22', '2021-04-29')

bin_edges

The bin_edges values are:

# closed='right' (default)
>>> bin_edges
[1618531199999999999 1619135999999999999 1619740799999999999]

# after conversion
DatetimeIndex(['2021-04-15 23:59:59.999999999',
               '2021-04-22 23:59:59.999999999',
               '2021-04-29 23:59:59.999999999'],
              dtype='datetime64[ns]', freq=None)


# closed='left'
>>> bin_edges
[1618444800000000000 1619049600000000000 1619654400000000000]

# after conversion
DatetimeIndex(['2021-04-15',
               '2021-04-22',
               '2021-04-29'],
              dtype='datetime64[ns]', freq=None)