Create a DataFrame with the total number of rows for each time interval grouped by ID-CodePudding

Given the following DataFrame of pandas in Python:

|      ID      |     date                              |
|--------------|---------------------------------------|
|     ESP      |        2022-03-02 07:24:19 01:00      |
|     ESP      |        2022-03-02 07:24:19 01:00      |
|     ESP      |        2022-03-02 08:00:00 01:00      |
|     UK       |        2022-03-02 08:08:30 01:00      |
|     ESP      |        2022-03-02 09:11:50 01:00      |
|     USA      |        2022-03-02 10:19:11 01:00      |
|     UK       |        2022-03-02 10:12:11 01:00      |
|     USA      |        2022-03-03 08:33:22 01:00      |
|     USA      |        2022-03-03 09:23:22 01:00      |
|     UK       |        2022-03-03 12:13:22 01:00      |
|     UK       |        2022-03-03 12:35:22 01:00      |

With the following code implemented in Python, I get the following DataFrame:

def create_dataframe(df):
    df['date'] = pd.to_datetime(df['date'].astype(str).str.split(' ').str[0])

    string = df['date'].groupby(df['date'].dt.floor('H')).count()

    df = pd.DataFrame({'date': string.index.date, 'start_interval': string.index.time, 
                    'end_interval': (string.index   pd.DateOffset(hours=1)).time,
                    'total_rows': string.to_numpy()})
    
    return df

|      date             |  start_interval   | end_interval      | total_rows |
|-----------------------|-------------------|-------------------|------------|
|     2022-03-02        |        07:00:00   | 08:00:00          |     2      |
|     2022-03-02        |        08:00:00   | 09:00:00          |     2      |
|     2022-03-02        |        09:00:00   | 10:00:00          |     1      |
|     2022-03-02        |        10:00:00   | 11:00:00          |     2      |
|     2022-03-03        |        08:00:00   | 09:00:00          |     1      |
|     2022-03-03        |        09:00:00   | 10:00:00          |     1      |
|     2022-03-03        |        12:00:00   | 13:00:00          |     2      |

I would like to add to the table the information provided by the 'ID' column, i.e. get this DataFrame:

|   ID   |      date             |  start_interval   | end_interval      | total_rows |
|--------|-----------------------|-------------------|-------------------|------------|
|   ESP  |     2022-03-02        |        07:00:00   | 08:00:00          |     2      |
|   ESP  |     2022-03-02        |        08:00:00   | 09:00:00          |     1      |
|   UK   |     2022-03-02        |        08:00:00   | 09:00:00          |     1      |
|   ESP  |     2022-03-02        |        09:00:00   | 10:00:00          |     1      |
|   USA  |     2022-03-02        |        10:00:00   | 11:00:00          |     1      |
|   UK   |     2022-03-02        |        10:00:00   | 11:00:00          |     1      |
|   USA  |     2022-03-03        |        08:00:00   | 09:00:00          |     1      |
|   USA  |     2022-03-03        |        09:00:00   | 10:00:00          |     1      |
|   UK   |     2022-03-03        |        12:00:00   | 13:00:00          |     2      |

How could I modify the supplied code to obtain the resulting table? Thank you in advance for your help.

CodePudding user response：

Does this produce what you are looking for:

result = (
    df
    .groupby(['ID', df['date'].dt.floor('H')]).agg(total_rows=('date', 'count'))
    .reset_index()
    .assign(
        start_interval=lambda df: df['date'].dt.time,
        end_interval=lambda df: (df['date']   pd.Timedelta(hours=1)).dt.time,
        date=lambda df: df['date'].dt.date
    )
)

Result:

    ID        date  total_rows start_interval end_interval
0  ESP  2022-03-02           2       07:00:00     08:00:00
1  ESP  2022-03-02           1       08:00:00     09:00:00
2  ESP  2022-03-02           1       09:00:00     10:00:00
3   UK  2022-03-02           1       08:00:00     09:00:00
4   UK  2022-03-02           1       10:00:00     11:00:00
5   UK  2022-03-03           2       12:00:00     13:00:00
6  USA  2022-03-02           1       10:00:00     11:00:00
7  USA  2022-03-03           1       08:00:00     09:00:00
8  USA  2022-03-03           1       09:00:00     10:00:00