I have a following data frame:
import pandas as pd
dict_df = {
"id": [1, 2, 2, 2, 3, 3, 3, 3],
"time": [
"2022-08-30 08:00:02",
"2022-08-30 08:03:07",
"2022-08-30 08:05:12",
"2022-08-30 12:06:52",
"2022-08-30 10:00:02",
"2022-08-30 10:00:27",
"2022-08-30 13:00:45",
"2022-08-30 13:02:27",
],
}
df = pd.DataFrame(dict_df)
I would like to get a new df with columns start_time
and end_time
based on time
in the df above. For each employee I would like to check if the pause between time
is larger than 10 minutes. If yes I would like to consider it as a break and record it in a new row.
Desired output is:
new_dict = {
"id": [1, 2, 2, 3, 3],
"start_time": [
"2022-08-30 08:00:02",
"2022-08-30 08:03:07",
"2022-08-30 12:06:52",
"2022-08-30 10:00:02",
"2022-08-30 13:00:45",
],
"end_time": [
"2022-08-30 08:00:02",
"2022-08-30 08:05:12",
"2022-08-30 12:06:52",
"2022-08-30 10:00:27",
"2022-08-30 13:02:27",
],
}
new_df = pd.DataFrame(new_dict)
How can I do it, please?
CodePudding user response:
Create groups by custom function in GroupBy.transform
with compare differencies with cumulative sum and then aggregate GroupBy.first
and
GroupBy.last
:
df['time'] = pd.to_datetime(df['time'])
minutes = 10
f = lambda x: x.diff().dt.total_seconds().gt(minutes * 60).cumsum()
df['g'] = df.groupby('id')['time'].transform(f)
df = (df.groupby(['id','g'])
.agg(start_time=('time','first'),end_time=('time','last'))
.droplevel(1)
.reset_index())
print (df)
id start_time end_time
0 1 2022-08-30 08:00:02 2022-08-30 08:00:02
1 2 2022-08-30 08:03:07 2022-08-30 08:05:12
2 2 2022-08-30 12:06:52 2022-08-30 12:06:52
3 3 2022-08-30 10:00:02 2022-08-30 10:00:27
4 3 2022-08-30 13:00:45 2022-08-30 13:02:27