Get DataFrame with the number of rows for each time interval-CodePudding

Given the following DataFrame of pandas in Python:

|      ID      |     date                              |
|--------------|---------------------------------------|
|     2        |        2022-03-02 07:24:19 01:00      |
|     2        |        2022-03-02 07:24:19 01:00      |
|     0        |        2022-03-02 08:00:00 01:00      |
|     0        |        2022-03-02 08:08:30 01:00      |
|     1        |        2022-03-02 09:11:50 01:00      |
|     1        |        2022-03-02 10:19:11 01:00      |
|     1        |        2022-03-02 10:12:11 01:00      |
|     3        |        2022-03-03 08:33:22 01:00      |
|     3        |        2022-03-03 09:23:22 01:00      |
|     3        |        2022-03-03 12:13:22 01:00      |
|     3        |        2022-03-03 12:35:22 01:00      |

I need to create a new DataFrame containing the total number of rows for each day in a given time interval, specified by parameter. Let's assume 1 hour for this example. Example of the DataFrame I want to obtain:

|      date             |  start_interval   | end_interval      | total_rows |
|-----------------------|-------------------|-------------------|------------|
|     2022-03-02        |        00:00:00   | 01:00:00          |     0      |
|     2022-03-02        |        01:00:00   | 02:00:00          |     0      |
|     2022-03-02        |        02:00:00   | 03:00:00          |     0      |
|     2022-03-02        |        03:00:00   | 04:00:00          |     0      |
|     2022-03-02        |        04:00:00   | 05:00:00          |     0      |
|     2022-03-02        |        05:00:00   | 06:00:00          |     0      |
|     2022-03-02        |        06:00:00   | 07:00:00          |     0      |
|     2022-03-02        |        07:00:00   | 08:00:00          |     2      |
|     2022-03-02        |        08:00:00   | 09:00:00          |     2      |
|     2022-03-02        |        09:00:00   | 10:00:00          |     1      |
|     2022-03-02        |        10:00:00   | 11:00:00          |     2      |
|     2022-03-02        |        11:00:00   | 12:00:00          |     0      |
|     2022-03-02        |        12:00:00   | 13:00:00          |     0      |
|     2022-03-02        |        13:00:00   | 14:00:00          |     0      |
|     2022-03-02        |        14:00:00   | 15:00:00          |     0      |
|     2022-03-02        |        15:00:00   | 16:00:00          |     0      |
|     2022-03-02        |        16:00:00   | 17:00:00          |     0      |
|     2022-03-02        |        17:00:00   | 18:00:00          |     0      |
|     2022-03-02        |        18:00:00   | 19:00:00          |     0      |
|     2022-03-02        |        19:00:00   | 20:00:00          |     0      |
|     2022-03-02        |        20:00:00   | 21:00:00          |     0      |
|     2022-03-02        |        21:00:00   | 22:00:00          |     0      |
|     2022-03-02        |        22:00:00   | 23:00:00          |     0      |
|     2022-03-02        |        23:00:00   | 00:00:00          |     0      |
|     2022-03-03        |        00:00:00   | 01:00:00          |     0      |
|     2022-03-03        |        01:00:00   | 02:00:00          |     0      |
|     2022-03-03        |        02:00:00   | 03:00:00          |     0      |
|     2022-03-03        |        03:00:00   | 04:00:00          |     0      |
|     2022-03-03        |        04:00:00   | 05:00:00          |     0      |
|     2022-03-03        |        05:00:00   | 06:00:00          |     0      |
|     2022-03-03        |        06:00:00   | 07:00:00          |     0      |
|     2022-03-03        |        07:00:00   | 08:00:00          |     0      |
|     2022-03-03        |        08:00:00   | 09:00:00          |     1      |
|     2022-03-03        |        09:00:00   | 10:00:00          |     1      |
|     2022-03-03        |        10:00:00   | 11:00:00          |     0      |
|     2022-03-03        |        11:00:00   | 12:00:00          |     0      |
|     2022-03-03        |        12:00:00   | 13:00:00          |     2      |
|     2022-03-03        |        13:00:00   | 14:00:00          |     0      |
|     2022-03-03        |        14:00:00   | 15:00:00          |     0      |
|     2022-03-03        |        15:00:00   | 16:00:00          |     0      |
|     2022-03-03        |        16:00:00   | 17:00:00          |     0      |
|     2022-03-03        |        17:00:00   | 18:00:00          |     0      |
|     2022-03-03        |        18:00:00   | 19:00:00          |     0      |
|     2022-03-03        |        19:00:00   | 20:00:00          |     0      |
|     2022-03-03        |        20:00:00   | 21:00:00          |     0      |
|     2022-03-03        |        21:00:00   | 22:00:00          |     0      |
|     2022-03-03        |        22:00:00   | 23:00:00          |     0      |
|     2022-03-03        |        23:00:00   | 00:00:00          |     0      |

My idea is to finally delete all rows containing a 0 in the total_rows column.

df= df[df['total_rows'] != 0]

|      date             |  start_interval   | end_interval      | total_rows |
|-----------------------|-------------------|-------------------|------------|
|     2022-03-02        |        07:00:00   | 08:00:00          |     2      |
|     2022-03-02        |        08:00:00   | 09:00:00          |     2      |
|     2022-03-02        |        09:00:00   | 10:00:00          |     1      |
|     2022-03-02        |        10:00:00   | 11:00:00          |     2      |
|     2022-03-03        |        08:00:00   | 09:00:00          |     1      |
|     2022-03-03        |        09:00:00   | 10:00:00          |     1      |
|     2022-03-03        |        12:00:00   | 13:00:00          |     2      |

How could I get this result. Thank you for your help in advance.

CodePudding user response：

Floor your date column then count number of occurrences:

s = df['date'].groupby(df['date'].dt.floor('H')).count()

out = pd.DataFrame({'date': s.index.date, 'start_interval': s.index.time, 
                    'end_interval': (s.index   pd.DateOffset(hours=1)).time,
                    'total_rows': s.to_numpy()})
print(out)

# Output
         date start_interval end_interval  total_rows
0  2022-03-02       07:00:00     08:00:00           2
1  2022-03-02       08:00:00     09:00:00           2
2  2022-03-02       09:00:00     10:00:00           1
3  2022-03-02       10:00:00     11:00:00           2
4  2022-03-03       08:00:00     09:00:00           1
5  2022-03-03       09:00:00     10:00:00           1
6  2022-03-03       12:00:00     13:00:00           2

CodePudding user response：

That's a nice job for pd.Grouper:

z = df.groupby(
    pd.Grouper(freq='1h', key='date')
).size().to_frame('total_rows').reset_index()

out = z.assign(
    start_interval=z['date'].dt.time,
    end_interval=(z['date']   pd.Timedelta(1, 'hour')).dt.time,
    date=z['date'].dt.normalize(),
)