Home > Software design >  Python dataframe - resample timestamps, group by hour, but keep the start and end datetime
Python dataframe - resample timestamps, group by hour, but keep the start and end datetime

Time:01-27

I have a DataFrame containing timestamps and values.

    list = ['2020-04-22 13:29:00 00:00','2020-04-22 13:31:00 00:00','2020-04-22 13:32:00 00:00','2020-04-22 13:33:00 00:00','2020-04-22 13:34:00 00:00','2020-04-22 13:35:00 00:00','2020-04-22 13:36:00 00:00','2020-04-22 13:54:00 00:00','2020-04-22 13:55:00 00:00','2020-04-22 13:56:00 00:00','2020-04-22 13:57:00 00:00','2020-04-22 13:58:00 00:00','2020-04-22 13:59:00 00:00','2020-04-22 14:00:00 00:00','2020-04-22 14:01:00 00:00','2020-04-22 14:02:00 00:00','2020-04-22 14:03:00 00:00','2020-04-22 14:04:00 00:00','2020-04-22 14:05:00 00:00','2020-04-22 14:06:00 00:00','2020-04-22 14:49:00 00:00','2020-04-22 14:50:00 00:00','2020-04-22 14:51:00 00:00','2020-04-22 14:52:00 00:00','2020-04-22 14:53:00 00:00','2020-04-22 14:54:00 00:00','2020-04-22 14:55:00 00:00','2020-04-22 14:56:00 00:00','2020-04-22 14:57:00 00:00','2020-04-22 14:58:00 00:00','2020-04-22 14:59:00 00:00','2020-04-22 15:00:00 00:00','2020-04-22 15:01:00 00:00','2020-04-22 15:02:00 00:00','2020-04-22 15:24:00 00:00','2020-04-22 15:25:00 00:00','2020-04-22 15:26:00 00:00','2020-04-22 15:27:00 00:00','2020-04-22 15:28:00 00:00','2020-04-22 15:29:00 00:00','2020-04-22 15:36:00 00:00']
df = pd.DataFrame(list)
df.columns = ['timestamp']
df['rand']=np.random.randint(0, 100, df.shape[0])
df.timestamp = pd.to_datetime(df.timestamp)
df.set_index('timestamp', inplace=True)

I want to resample by data : group the timestamps by hour and sum the values.

It can easily be done via the Pandas function resample:

agg_dict = {'rand': 'sum'}
dfr = df.resample('1H').agg(agg_dict)

The problem is that this methods losses the original start and end time.

If my first hour started at 13:29, the first aggregated hour will appear as 13:00 to 14:00. Which is "wrong" because there was no record between 13:00 and 13:28

Is there an existing function grouping by hours but respecting start and end time, and if possible, displaying the start and end time of the intervals?

Displaying start and end time would allow to avoid the confusion on the last aggregation.

If my last hour ends at 15:36, the aggregated hour will appear as 15:00 and if the end time is not specified, one can assume that it means 15:00 to 15:59, which is "wrong" as there is no data beyond 15:36.

Before coding something myself I waned to see with the community if something exists out of the box. I tried the options of Pandas Resampling and failed to find a similar function.

Thanks for any help

CodePudding user response:

You can aggregate the timestamp in each hourly interval:

agg_dict = {
    "from": ("ts", "min"),
    "to": ("ts", "max"),
    "sum": ("rand", "sum"),
}
df.assign(ts=df.index).resample("1H").agg(**agg_dict)

This uses Named Aggregation. It means take column ts, apply min and called the aggregate output from, etc.

CodePudding user response:

Maybe you are looking for:

agg_dict = {'rand': 'sum', 'timestamp': ['min', 'max']}
out = df.reset_index().resample('H', on='timestamp').agg(agg_dict)
print(out)

# Output
                          rand                 timestamp                          
                           sum                       min                       max
timestamp                                                                         
2020-04-22 13:00:00 00:00  440 2020-04-22 13:29:00 00:00 2020-04-22 13:59:00 00:00
2020-04-22 14:00:00 00:00  982 2020-04-22 14:00:00 00:00 2020-04-22 14:59:00 00:00
2020-04-22 15:00:00 00:00  502 2020-04-22 15:00:00 00:00 2020-04-22 15:36:00 00:00
  • Related