pandas RollingGroupBy agg 'size' of rolling group (not 'count')-CodePudding

It is possible to perform a

df.groupby.rolling.agg({'any_df_col': 'count'})

But how about a size agg?

'count' will produce a series with the 'running count' of rows that match the groupby condition (1, 1, 1, 2, 3...), but I would like to know, for all of those rows, the total number of rows that match the groupby (so, 1, 1, 3, 3, 3) in that case.

Usually in pandas I think this is achieved by using size instead of count.

This code may illustrate.

import datetime as dt

import pandas as pd

df = pd.DataFrame({'time_ref': [
    dt.datetime(2023, 1, 1, 0, 30),
    dt.datetime(2023, 1, 1, 0, 30),
    dt.datetime(2023, 1, 1, 1),
    dt.datetime(2023, 1, 1, 2),
    dt.datetime(2023, 1, 1, 2, 15),
    dt.datetime(2023, 1, 1, 2, 16),
    dt.datetime(2023, 1, 1, 4),
],
    'value': [1, 2, 1, 10, 10, 10, 10],
    'type': [0, 0, 0, 0, 0, 0, 0]
})
df = df.set_index(pd.DatetimeIndex(df['time_ref']), drop=True)
by = ['value']
window = '1H'
gb_rolling = df.groupby(by=by).rolling(window=window)
agg_d = {'type': 'count'}
test = gb_rolling.agg(agg_d)
print (test)
# this works

                           type
value time_ref                 
1     2023-01-01 00:30:00   1.0
      2023-01-01 01:00:00   2.0
2     2023-01-01 00:30:00   1.0
10    2023-01-01 02:00:00   1.0
      2023-01-01 02:15:00   2.0
      2023-01-01 02:16:00   3.0
      2023-01-01 04:00:00   1.0

# but this doesn't
agg_d = {'type': 'size'}
test = gb_rolling.agg(agg_d)
# AttributeError: 'size' is not a valid function for 'RollingGroupby' object

my desired output is to get the SIZE of the group ... this:

                           type
value time_ref
1     2023-01-01 00:30:00   2
      2023-01-01 01:00:00   2
2     2023-01-01 00:30:00   1
10    2023-01-01 02:00:00   3
      2023-01-01 02:15:00   3
      2023-01-01 02:16:00   3
      2023-01-01 04:00:00   1

I cannot think of a way to do what I need without using the rolling functionality, because the relevant windows of my data are not deteremined by calendar time but by the time of the events themselves... if that assumption is wrong, and I can do that and get a 'size' without using rolling, that is OK, but as far as I know I have to use rolling since the time_ref of the event is the important thing for grouping with subsequent rows, not pure calendar time.

Thanks.

CodePudding user response：

I'm not completely following your question. It seems like you want the type column to be the number of rows of a given value for each 1-hour increment... But if that's the case your desired output is incorrect, and should be:

value time_ref              type
1     2023-01-01 00:30:00   1  # <- not 2 here (1 in 0-hr, 1 in 1-hr window)
      2023-01-01 01:00:00   1  # <- same here
2     2023-01-01 00:30:00   1  # rest is ok....
...

If that's correct, then, starting with:

df = pd.DataFrame({
    'time_ref': [
        dt.datetime(2023, 1, 1, 0, 30),
        dt.datetime(2023, 1, 1, 0, 30),
        dt.datetime(2023, 1, 1, 1),
        dt.datetime(2023, 1, 1, 2),
        dt.datetime(2023, 1, 1, 2, 15),
        dt.datetime(2023, 1, 1, 2, 16),
        dt.datetime(2023, 1, 1, 4)],
    'value': [1, 2, 1, 10, 10, 10, 10]})

...just add an hour column:

df['hour'] = df.time_ref.dt.hour

and aggregate on that and value:

tmp = (
    df.groupby(['value', 'hour'])
    .agg('count')
    .reset_index()
    .rename(columns={'time_ref': 'type'}))

which gives you:

    value   hour type
0   1       0    1
1   1       1    1
2   2       0    1
3   10      2    3
4   10      4    1

...which you can join back onto your original df:

res = df.merge(tmp, how='left', on=['value', 'hour'])

    time_ref                value hour  type
0   2023-01-01 00:30:00     1     0     1
1   2023-01-01 00:30:00     2     0     1
2   2023-01-01 01:00:00     1     1     1
3   2023-01-01 02:00:00     10    2     3
4   2023-01-01 02:15:00     10    2     3
5   2023-01-01 02:16:00     10    2     3
6   2023-01-01 04:00:00     10    4     1

If that's not what you're looking for, you may clarify your question.

CodePudding user response：

Ah.. thanks for clarifying. I understand the problem now. I played around with rolling, but couldn't find a way to get it to work either... but here is an alternate method:

df = pd.DataFrame({
    'time_ref': [
        dt.datetime(2023, 1, 1, 0, 30),
        dt.datetime(2023, 1, 1, 0, 30),
        dt.datetime(2023, 1, 1, 1),
        dt.datetime(2023, 1, 1, 2),
        dt.datetime(2023, 1, 1, 2, 15),
        dt.datetime(2023, 1, 1, 2, 16),
        dt.datetime(2023, 1, 1, 4)],
    'value': [1, 2, 1, 10, 10, 10, 10]})
df.index = df.time_ref

value_start = df.groupby('value').agg(min)

df['hrs_since_group_start'] = df.apply(
    lambda row: row.time_ref - value_start.loc[row.value, 'time_ref'],
    axis=1
).view(int) / 1_000_000_000 / 60 // 60

(.view(int) changes the timedelta to nanoseconds. so the / 1_000_000_000 / 60 changes it to minutes since the first group, and // 60 changes it to number of whole hours since first group.)

group_hourly_counts = (
    df.groupby(['value', 'hrs_since_group_start'])
    .agg('count')
    .reset_index()
    .rename(columns={'time_ref': 'type'}))

res = (
    df.merge(
        group_hourly_counts,
        how='left',
        on=['value', 'hrs_since_group_start'])
    .drop(columns='hrs_since_group_start'))

res:

    time_ref                value type
0   2023-01-01 00:30:00     1     2
1   2023-01-01 00:30:00     2     1
2   2023-01-01 01:00:00     1     2
3   2023-01-01 02:00:00     10    3
4   2023-01-01 02:15:00     10    3
5   2023-01-01 02:16:00     10    3
6   2023-01-01 04:00:00     10    1

...somebody more familiar with the rolling functionality can probably find you a simpler solution though :)