Given the following DataFrame of pandas in Python:
| ID | date |
|--------------|---------------------------------------|
| 2 | 2022-03-02 07:24:19 01:00 |
| 2 | 2022-03-02 07:24:19 01:00 |
| 0 | 2022-03-02 08:00:00 01:00 |
| 0 | 2022-03-02 08:08:30 01:00 |
| 1 | 2022-03-02 09:11:50 01:00 |
| 1 | 2022-03-02 10:19:11 01:00 |
| 1 | 2022-03-02 10:12:11 01:00 |
| 3 | 2022-03-03 08:33:22 01:00 |
| 3 | 2022-03-03 09:23:22 01:00 |
| 3 | 2022-03-03 12:13:22 01:00 |
| 3 | 2022-03-03 12:35:22 01:00 |
I need to create a new DataFrame containing the total number of rows for each day in a given time interval, specified by parameter. Let's assume 1 hour for this example. Example of the DataFrame I want to obtain:
| date | start_interval | end_interval | total_rows |
|-----------------------|-------------------|-------------------|------------|
| 2022-03-02 | 00:00:00 | 01:00:00 | 0 |
| 2022-03-02 | 01:00:00 | 02:00:00 | 0 |
| 2022-03-02 | 02:00:00 | 03:00:00 | 0 |
| 2022-03-02 | 03:00:00 | 04:00:00 | 0 |
| 2022-03-02 | 04:00:00 | 05:00:00 | 0 |
| 2022-03-02 | 05:00:00 | 06:00:00 | 0 |
| 2022-03-02 | 06:00:00 | 07:00:00 | 0 |
| 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| 2022-03-02 | 08:00:00 | 09:00:00 | 2 |
| 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-02 | 10:00:00 | 11:00:00 | 2 |
| 2022-03-02 | 11:00:00 | 12:00:00 | 0 |
| 2022-03-02 | 12:00:00 | 13:00:00 | 0 |
| 2022-03-02 | 13:00:00 | 14:00:00 | 0 |
| 2022-03-02 | 14:00:00 | 15:00:00 | 0 |
| 2022-03-02 | 15:00:00 | 16:00:00 | 0 |
| 2022-03-02 | 16:00:00 | 17:00:00 | 0 |
| 2022-03-02 | 17:00:00 | 18:00:00 | 0 |
| 2022-03-02 | 18:00:00 | 19:00:00 | 0 |
| 2022-03-02 | 19:00:00 | 20:00:00 | 0 |
| 2022-03-02 | 20:00:00 | 21:00:00 | 0 |
| 2022-03-02 | 21:00:00 | 22:00:00 | 0 |
| 2022-03-02 | 22:00:00 | 23:00:00 | 0 |
| 2022-03-02 | 23:00:00 | 00:00:00 | 0 |
| 2022-03-03 | 00:00:00 | 01:00:00 | 0 |
| 2022-03-03 | 01:00:00 | 02:00:00 | 0 |
| 2022-03-03 | 02:00:00 | 03:00:00 | 0 |
| 2022-03-03 | 03:00:00 | 04:00:00 | 0 |
| 2022-03-03 | 04:00:00 | 05:00:00 | 0 |
| 2022-03-03 | 05:00:00 | 06:00:00 | 0 |
| 2022-03-03 | 06:00:00 | 07:00:00 | 0 |
| 2022-03-03 | 07:00:00 | 08:00:00 | 0 |
| 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-03 | 10:00:00 | 11:00:00 | 0 |
| 2022-03-03 | 11:00:00 | 12:00:00 | 0 |
| 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
| 2022-03-03 | 13:00:00 | 14:00:00 | 0 |
| 2022-03-03 | 14:00:00 | 15:00:00 | 0 |
| 2022-03-03 | 15:00:00 | 16:00:00 | 0 |
| 2022-03-03 | 16:00:00 | 17:00:00 | 0 |
| 2022-03-03 | 17:00:00 | 18:00:00 | 0 |
| 2022-03-03 | 18:00:00 | 19:00:00 | 0 |
| 2022-03-03 | 19:00:00 | 20:00:00 | 0 |
| 2022-03-03 | 20:00:00 | 21:00:00 | 0 |
| 2022-03-03 | 21:00:00 | 22:00:00 | 0 |
| 2022-03-03 | 22:00:00 | 23:00:00 | 0 |
| 2022-03-03 | 23:00:00 | 00:00:00 | 0 |
My idea is to finally delete all rows containing a 0 in the total_rows column.
df= df[df['total_rows'] != 0]
| date | start_interval | end_interval | total_rows |
|-----------------------|-------------------|-------------------|------------|
| 2022-03-02 | 07:00:00 | 08:00:00 | 2 |
| 2022-03-02 | 08:00:00 | 09:00:00 | 2 |
| 2022-03-02 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-02 | 10:00:00 | 11:00:00 | 2 |
| 2022-03-03 | 08:00:00 | 09:00:00 | 1 |
| 2022-03-03 | 09:00:00 | 10:00:00 | 1 |
| 2022-03-03 | 12:00:00 | 13:00:00 | 2 |
How could I get this result. Thank you for your help in advance.
CodePudding user response:
Floor your date
column then count number of occurrences:
s = df['date'].groupby(df['date'].dt.floor('H')).count()
out = pd.DataFrame({'date': s.index.date, 'start_interval': s.index.time,
'end_interval': (s.index pd.DateOffset(hours=1)).time,
'total_rows': s.to_numpy()})
print(out)
# Output
date start_interval end_interval total_rows
0 2022-03-02 07:00:00 08:00:00 2
1 2022-03-02 08:00:00 09:00:00 2
2 2022-03-02 09:00:00 10:00:00 1
3 2022-03-02 10:00:00 11:00:00 2
4 2022-03-03 08:00:00 09:00:00 1
5 2022-03-03 09:00:00 10:00:00 1
6 2022-03-03 12:00:00 13:00:00 2
CodePudding user response:
That's a nice job for pd.Grouper
:
z = df.groupby(
pd.Grouper(freq='1h', key='date')
).size().to_frame('total_rows').reset_index()
out = z.assign(
start_interval=z['date'].dt.time,
end_interval=(z['date'] pd.Timedelta(1, 'hour')).dt.time,
date=z['date'].dt.normalize(),
)