Compute time differences between shifts with pandas-CodePudding

Let's say we have the following pandas dataframe, working on python:

worker	shift_start	shift_end	function
Alice	2022-01-15 11:30:00	2022-01-15 15:30:00	A
Alice	2022-01-15 17:30:00	2022-01-15 18:29:59	A
Alice	2022-01-15 18:30:00	2022-01-15 20:30:00	B
Alice	2022-01-16 10:30:00	2022-01-16 11:29:59	B
Alice	2022-01-65 12:30:00	2022-01-16 15:30:00	A
Bob	2022-01-15 10:30:00	2022-01-15 12:29:59	B
Bob	2022-01-15 12:30:00	2022-01-15 14:30:00	A
Bob	2022-01-15 15:30:00	2022-01-15 18:30:00	A
Bob	2022-01-17 10:30:00	2022-01-17 15:30:00	A
Bob	2022-01-17 16:30:00	2022-01-17 18:30:00	B

Where shift_start and shift_end are datetime columns. Each worker has their own schedule, with a function that may vary (irrelevant here). I would like to compute the break time between days, this is, the time difference between the end of the last shift and the beginning of the next one the next day they work:

worker	shift_start	shift_end	function	break_time
Alice	2022-01-15 11:30:00	2022-01-15 15:30:00	A	NaN
Alice	2022-01-15 17:30:00	2022-01-15 18:29:59	A	NaN
Alice	2022-01-15 18:30:00	2022-01-15 20:30:00	B	NaN
Alice	2022-01-16 10:30:00	2022-01-16 11:29:59	B	0 days 14 hours
Alice	2022-01-16 12:30:00	2022-01-16 15:30:00	A	0 days 14 hours
Bob	2022-01-15 10:30:00	2022-01-15 12:29:59	B	NaN
Bob	2022-01-15 12:30:00	2022-01-15 14:30:00	A	NaN
Bob	2022-01-15 15:30:00	2022-01-15 18:30:00	A	NaN
Bob	2022-01-17 10:30:00	2022-01-17 15:30:00	A	1 days 16 hours
Bob	2022-01-17 16:30:00	2022-01-17 18:30:00	B	1 days 16 hours

In both cases, January 15th has "NaN" because we do not know the previous shift. For Alice, there have been 14 hours between the end of her shift at 20:30h and the start of the next one the day after at 10:30. In the case of Bob, however, there is a gap day between, so the difference will be 1 day and 16 hours. This break_time column should be repeated for all the entries in a given day. I am unable to find a way to groupby the worker and then compute the last and first shift for each day. Any help would be much appreciated.

CodePudding user response：

I'm not completely sure about the structure of your data, but you could try the following (with df your dataframe):

shifted_end = df.groupby("worker")["shift_end"].shift()
m = df["shift_start"].dt.date != shifted_end.dt.date
df["break_time"] = (
    (df["shift_start"] - shifted_end)[m].reindex_like(df)
    .groupby(df["worker"]).ffill()
)

First build a series shifted_end with shifted values from the column shift_end. This has to be done grouped by worker to avoid mixing data from different workers.
Now build a mask m to identify the day breaks by comparing the .dt.date versions of shift_start and shifted_end.
Then subtract the shifted shift_end column from the shift_start column and extract only the results at the day breaks m. Use .reindex_like(df) to get the full index back, with NaNs at the recovered indices. Then fill the NaNs forward to propagate the day break values. This, again, has to be done grouped by worker - otherwise the start of each worker group would get the last value from the worker before.

Result for your sample is:

  worker         shift_start           shift_end function      break_time
0  Alice 2022-01-15 11:30:00 2022-01-15 15:30:00        A             NaT
1  Alice 2022-01-15 17:30:00 2022-01-15 18:29:59        A             NaT
2  Alice 2022-01-15 18:30:00 2022-01-15 20:30:00        B             NaT
3  Alice 2022-01-16 10:30:00 2022-01-16 11:29:59        B 0 days 14:00:00
4  Alice 2022-01-16 12:30:00 2022-01-16 15:30:00        A 0 days 14:00:00
5    Bob 2022-01-15 10:30:00 2022-01-15 12:29:59        B             NaT
6    Bob 2022-01-15 12:30:00 2022-01-15 14:30:00        A             NaT
7    Bob 2022-01-15 15:30:00 2022-01-15 18:30:00        A             NaT
8    Bob 2022-01-17 10:30:00 2022-01-17 15:30:00        A 1 days 16:00:00
9    Bob 2022-01-17 16:30:00 2022-01-17 18:30:00        B 1 days 16:00:00