I have a DataFrame containing timestamps and values.
list = ['2020-04-22 13:29:00 00:00','2020-04-22 13:31:00 00:00','2020-04-22 13:32:00 00:00','2020-04-22 13:33:00 00:00','2020-04-22 13:34:00 00:00','2020-04-22 13:35:00 00:00','2020-04-22 13:36:00 00:00','2020-04-22 13:54:00 00:00','2020-04-22 13:55:00 00:00','2020-04-22 13:56:00 00:00','2020-04-22 13:57:00 00:00','2020-04-22 13:58:00 00:00','2020-04-22 13:59:00 00:00','2020-04-22 14:00:00 00:00','2020-04-22 14:01:00 00:00','2020-04-22 14:02:00 00:00','2020-04-22 14:03:00 00:00','2020-04-22 14:04:00 00:00','2020-04-22 14:05:00 00:00','2020-04-22 14:06:00 00:00','2020-04-22 14:49:00 00:00','2020-04-22 14:50:00 00:00','2020-04-22 14:51:00 00:00','2020-04-22 14:52:00 00:00','2020-04-22 14:53:00 00:00','2020-04-22 14:54:00 00:00','2020-04-22 14:55:00 00:00','2020-04-22 14:56:00 00:00','2020-04-22 14:57:00 00:00','2020-04-22 14:58:00 00:00','2020-04-22 14:59:00 00:00','2020-04-22 15:00:00 00:00','2020-04-22 15:01:00 00:00','2020-04-22 15:02:00 00:00','2020-04-22 15:24:00 00:00','2020-04-22 15:25:00 00:00','2020-04-22 15:26:00 00:00','2020-04-22 15:27:00 00:00','2020-04-22 15:28:00 00:00','2020-04-22 15:29:00 00:00','2020-04-22 15:36:00 00:00']
df = pd.DataFrame(list)
df.columns = ['timestamp']
df['rand']=np.random.randint(0, 100, df.shape[0])
df.timestamp = pd.to_datetime(df.timestamp)
df.set_index('timestamp', inplace=True)
I want to resample by data : group the timestamps by hour and sum the values.
It can easily be done via the Pandas function resample:
agg_dict = {'rand': 'sum'}
dfr = df.resample('1H').agg(agg_dict)
The problem is that this methods losses the original start and end time.
If my first hour started at 13:29, the first aggregated hour will appear as 13:00 to 14:00. Which is "wrong" because there was no record between 13:00 and 13:28
Is there an existing function grouping by hours but respecting start and end time, and if possible, displaying the start and end time of the intervals?
Displaying start and end time would allow to avoid the confusion on the last aggregation.
If my last hour ends at 15:36, the aggregated hour will appear as 15:00 and if the end time is not specified, one can assume that it means 15:00 to 15:59, which is "wrong" as there is no data beyond 15:36.
Before coding something myself I waned to see with the community if something exists out of the box. I tried the options of Pandas Resampling and failed to find a similar function.
Thanks for any help
CodePudding user response:
You can aggregate the timestamp in each hourly interval:
agg_dict = {
"from": ("ts", "min"),
"to": ("ts", "max"),
"sum": ("rand", "sum"),
}
df.assign(ts=df.index).resample("1H").agg(**agg_dict)
This uses Named Aggregation. It means take column ts
, apply min
and called the aggregate output from
, etc.
CodePudding user response:
Maybe you are looking for:
agg_dict = {'rand': 'sum', 'timestamp': ['min', 'max']}
out = df.reset_index().resample('H', on='timestamp').agg(agg_dict)
print(out)
# Output
rand timestamp
sum min max
timestamp
2020-04-22 13:00:00 00:00 440 2020-04-22 13:29:00 00:00 2020-04-22 13:59:00 00:00
2020-04-22 14:00:00 00:00 982 2020-04-22 14:00:00 00:00 2020-04-22 14:59:00 00:00
2020-04-22 15:00:00 00:00 502 2020-04-22 15:00:00 00:00 2020-04-22 15:36:00 00:00