There is a dataframe with hourly data, e.g.:
DATE TIME Amount
2022-11-07 21:00:00 10
2022-11-07 22:00:00 11
2022-11-08 07:00:00 10
2022-11-08 08:00:00 13
2022-11-08 09:00:00 12
2022-11-08 10:00:00 11
2022-11-08 11:00:00 13
2022-11-08 12:00:00 12
2022-11-08 13:00:00 10
2022-11-08 14:00:00 9
...
I would like to add a new column sum_morning where I calculate the sum of "Amount" for the morning hours only (07:00 - 12:00):
DATE TIME Amount sum_morning
2022-11-07 21:00:00 10 NaN
2022-11-07 22:00:00 11 NaN
2022-11-08 07:00:00 10 NaN
2022-11-08 08:00:00 13 NaN
2022-11-08 09:00:00 12 NaN
2022-11-08 10:00:00 11 NaN
2022-11-08 11:00:00 13 NaN
2022-11-08 12:00:00 12 71
2022-11-08 13:00:00 10 NaN
2022-11-08 14:00:00 9 NaN
...
There can be gaps in the dataframe (e.g. from 22:00 - 07:00), so shift is probably not working here.
I thought about
- creating a new dataframe where I filter all time slices from 07:00 - 12:00 for all dates
- do a group by and calculate the sum for each day
- and then merge this back to the original df.
But maybe there is a more effective solution?
I really enjoy working with Python / pandas, but hourly data still makes my head spin.
CodePudding user response:
First set a DatetimeIndex
in order to use DataFrame.between_time
, then groupby DATE and aggregate by sum
. Finally, get the last value of datetimes per day, in order to match the index of the original DataFrame:
df.index = pd.to_datetime(df['DATE'] ' ' df['TIME'])
s = (df.between_time('7:00','12:00')
.reset_index()
.groupby('DATE')
.agg({'Amount':'sum', 'index':'last'})
.set_index('index')['Amount'])
df['sum_morning'] = s
print (df)
DATE TIME Amount sum_morning
2022-11-07 21:00:00 2022-11-07 21:00:00 10 NaN
2022-11-07 22:00:00 2022-11-07 22:00:00 11 NaN
2022-11-08 07:00:00 2022-11-08 07:00:00 10 NaN
2022-11-08 08:00:00 2022-11-08 08:00:00 13 NaN
2022-11-08 09:00:00 2022-11-08 09:00:00 12 NaN
2022-11-08 10:00:00 2022-11-08 10:00:00 11 NaN
2022-11-08 11:00:00 2022-11-08 11:00:00 13 NaN
2022-11-08 12:00:00 2022-11-08 12:00:00 12 71.0
2022-11-08 13:00:00 2022-11-08 13:00:00 10 NaN
2022-11-08 14:00:00 2022-11-08 14:00:00 9 NaN
Lastly, if you need to remove DatetimeIndex
you can use:
df = df.reset_index(drop=True)
CodePudding user response:
You can use:
# get values between 7 and 12h
m = pd.to_timedelta(df['TIME']).between('7h', '12h')
# find last True per day
idx = m&m.groupby(df['DATE']).shift(-1).ne(True)
# assign the sum of the 7-12h values on the last True per day
df.loc[idx, 'sum_morning'] = df['Amount'].where(m).groupby(df['DATE']).transform('sum')
Output:
DATE TIME Amount sum_morning
0 2022-11-07 21:00:00 10 NaN
1 2022-11-07 22:00:00 11 NaN
2 2022-11-08 07:00:00 10 NaN
3 2022-11-08 08:00:00 13 NaN
4 2022-11-08 09:00:00 12 NaN
5 2022-11-08 10:00:00 11 NaN
6 2022-11-08 11:00:00 13 NaN
7 2022-11-08 12:00:00 12 71.0
8 2022-11-08 13:00:00 10 NaN
9 2022-11-08 14:00:00 9 NaN