It is possible to perform a
df.groupby.rolling.agg({'any_df_col': 'count'})
But how about a size agg?
'count' will produce a series with the 'running count' of rows that match the groupby condition (1, 1, 1, 2, 3...), but I would like to know, for all of those rows, the total number of rows that match the groupby (so, 1, 1, 3, 3, 3) in that case.
Usually in pandas I think this is achieved by using size instead of count.
This code may illustrate.
import datetime as dt
import pandas as pd
df = pd.DataFrame({'time_ref': [
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 1),
dt.datetime(2023, 1, 1, 2),
dt.datetime(2023, 1, 1, 2, 15),
dt.datetime(2023, 1, 1, 2, 16),
dt.datetime(2023, 1, 1, 4),
],
'value': [1, 2, 1, 10, 10, 10, 10],
'type': [0, 0, 0, 0, 0, 0, 0]
})
df = df.set_index(pd.DatetimeIndex(df['time_ref']), drop=True)
by = ['value']
window = '1H'
gb_rolling = df.groupby(by=by).rolling(window=window)
agg_d = {'type': 'count'}
test = gb_rolling.agg(agg_d)
print (test)
# this works
type
value time_ref
1 2023-01-01 00:30:00 1.0
2023-01-01 01:00:00 2.0
2 2023-01-01 00:30:00 1.0
10 2023-01-01 02:00:00 1.0
2023-01-01 02:15:00 2.0
2023-01-01 02:16:00 3.0
2023-01-01 04:00:00 1.0
# but this doesn't
agg_d = {'type': 'size'}
test = gb_rolling.agg(agg_d)
# AttributeError: 'size' is not a valid function for 'RollingGroupby' object
my desired output is to get the SIZE of the group ... this:
type
value time_ref
1 2023-01-01 00:30:00 2
2023-01-01 01:00:00 2
2 2023-01-01 00:30:00 1
10 2023-01-01 02:00:00 3
2023-01-01 02:15:00 3
2023-01-01 02:16:00 3
2023-01-01 04:00:00 1
I cannot think of a way to do what I need without using the rolling functionality, because the relevant windows of my data are not deteremined by calendar time but by the time of the events themselves... if that assumption is wrong, and I can do that and get a 'size' without using rolling, that is OK, but as far as I know I have to use rolling since the time_ref of the event is the important thing for grouping with subsequent rows, not pure calendar time.
Thanks.
CodePudding user response:
I'm not completely following your question. It seems like you want the type
column to be the number of rows of a given value for each 1-hour increment... But if that's the case your desired output is incorrect, and should be:
value time_ref type
1 2023-01-01 00:30:00 1 # <- not 2 here (1 in 0-hr, 1 in 1-hr window)
2023-01-01 01:00:00 1 # <- same here
2 2023-01-01 00:30:00 1 # rest is ok....
...
If that's correct, then, starting with:
df = pd.DataFrame({
'time_ref': [
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 1),
dt.datetime(2023, 1, 1, 2),
dt.datetime(2023, 1, 1, 2, 15),
dt.datetime(2023, 1, 1, 2, 16),
dt.datetime(2023, 1, 1, 4)],
'value': [1, 2, 1, 10, 10, 10, 10]})
...just add an hour column:
df['hour'] = df.time_ref.dt.hour
and aggregate on that and value
:
tmp = (
df.groupby(['value', 'hour'])
.agg('count')
.reset_index()
.rename(columns={'time_ref': 'type'}))
which gives you:
value hour type
0 1 0 1
1 1 1 1
2 2 0 1
3 10 2 3
4 10 4 1
...which you can join back onto your original df:
res = df.merge(tmp, how='left', on=['value', 'hour'])
time_ref value hour type
0 2023-01-01 00:30:00 1 0 1
1 2023-01-01 00:30:00 2 0 1
2 2023-01-01 01:00:00 1 1 1
3 2023-01-01 02:00:00 10 2 3
4 2023-01-01 02:15:00 10 2 3
5 2023-01-01 02:16:00 10 2 3
6 2023-01-01 04:00:00 10 4 1
If that's not what you're looking for, you may clarify your question.
CodePudding user response:
Ah.. thanks for clarifying. I understand the problem now. I played around with rolling, but couldn't find a way to get it to work either... but here is an alternate method:
df = pd.DataFrame({
'time_ref': [
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 1),
dt.datetime(2023, 1, 1, 2),
dt.datetime(2023, 1, 1, 2, 15),
dt.datetime(2023, 1, 1, 2, 16),
dt.datetime(2023, 1, 1, 4)],
'value': [1, 2, 1, 10, 10, 10, 10]})
df.index = df.time_ref
value_start = df.groupby('value').agg(min)
df['hrs_since_group_start'] = df.apply(
lambda row: row.time_ref - value_start.loc[row.value, 'time_ref'],
axis=1
).view(int) / 1_000_000_000 / 60 // 60
(.view(int)
changes the timedelta to nanoseconds. so the / 1_000_000_000 / 60
changes it to minutes since the first group, and // 60
changes it to number of whole hours since first group.)
group_hourly_counts = (
df.groupby(['value', 'hrs_since_group_start'])
.agg('count')
.reset_index()
.rename(columns={'time_ref': 'type'}))
res = (
df.merge(
group_hourly_counts,
how='left',
on=['value', 'hrs_since_group_start'])
.drop(columns='hrs_since_group_start'))
res
:
time_ref value type
0 2023-01-01 00:30:00 1 2
1 2023-01-01 00:30:00 2 1
2 2023-01-01 01:00:00 1 2
3 2023-01-01 02:00:00 10 3
4 2023-01-01 02:15:00 10 3
5 2023-01-01 02:16:00 10 3
6 2023-01-01 04:00:00 10 1
...somebody more familiar with the rolling
functionality can probably find you a simpler solution though :)