Check for pauses in pandas df-CodePudding

I have a following data frame:

import pandas as pd

dict_df = {
    "id": [1, 2, 2, 2, 3, 3, 3, 3],
    "time": [
        "2022-08-30 08:00:02",
        "2022-08-30 08:03:07",
        "2022-08-30 08:05:12",
        "2022-08-30 12:06:52",
        "2022-08-30 10:00:02",
        "2022-08-30 10:00:27",
        "2022-08-30 13:00:45",
        "2022-08-30 13:02:27",
    ],
}

df = pd.DataFrame(dict_df)

I would like to get a new df with columns start_time and end_time based on time in the df above. For each employee I would like to check if the pause between time is larger than 10 minutes. If yes I would like to consider it as a break and record it in a new row.

Desired output is:

new_dict = {
    "id": [1, 2, 2, 3, 3],
    "start_time": [
        "2022-08-30 08:00:02",
        "2022-08-30 08:03:07",
        "2022-08-30 12:06:52",
        "2022-08-30 10:00:02",
        "2022-08-30 13:00:45",
    ],

    "end_time": [
        "2022-08-30 08:00:02",
        "2022-08-30 08:05:12",
        "2022-08-30 12:06:52",
        "2022-08-30 10:00:27",
        "2022-08-30 13:02:27",
    ],
}

new_df = pd.DataFrame(new_dict)

How can I do it, please?

CodePudding user response：

Create groups by custom function in GroupBy.transform with compare differencies with cumulative sum and then aggregate GroupBy.first and GroupBy.last:

df['time'] = pd.to_datetime(df['time'])

minutes = 10
f = lambda x: x.diff().dt.total_seconds().gt(minutes * 60).cumsum()
df['g'] = df.groupby('id')['time'].transform(f)

df = (df.groupby(['id','g'])
        .agg(start_time=('time','first'),end_time=('time','last'))
        .droplevel(1)
        .reset_index())
print (df)
   id          start_time            end_time
0   1 2022-08-30 08:00:02 2022-08-30 08:00:02
1   2 2022-08-30 08:03:07 2022-08-30 08:05:12
2   2 2022-08-30 12:06:52 2022-08-30 12:06:52
3   3 2022-08-30 10:00:02 2022-08-30 10:00:27
4   3 2022-08-30 13:00:45 2022-08-30 13:02:27