Let's say we have the following pandas dataframe, working on python:
worker | shift_start | shift_end | function |
---|---|---|---|
Alice | 2022-01-15 11:30:00 | 2022-01-15 15:30:00 | A |
Alice | 2022-01-15 17:30:00 | 2022-01-15 18:29:59 | A |
Alice | 2022-01-15 18:30:00 | 2022-01-15 20:30:00 | B |
Alice | 2022-01-16 10:30:00 | 2022-01-16 11:29:59 | B |
Alice | 2022-01-65 12:30:00 | 2022-01-16 15:30:00 | A |
Bob | 2022-01-15 10:30:00 | 2022-01-15 12:29:59 | B |
Bob | 2022-01-15 12:30:00 | 2022-01-15 14:30:00 | A |
Bob | 2022-01-15 15:30:00 | 2022-01-15 18:30:00 | A |
Bob | 2022-01-17 10:30:00 | 2022-01-17 15:30:00 | A |
Bob | 2022-01-17 16:30:00 | 2022-01-17 18:30:00 | B |
Where shift_start and shift_end are datetime columns. Each worker has their own schedule, with a function that may vary (irrelevant here). I would like to compute the break time between days, this is, the time difference between the end of the last shift and the beginning of the next one the next day they work:
worker | shift_start | shift_end | function | break_time |
---|---|---|---|---|
Alice | 2022-01-15 11:30:00 | 2022-01-15 15:30:00 | A | NaN |
Alice | 2022-01-15 17:30:00 | 2022-01-15 18:29:59 | A | NaN |
Alice | 2022-01-15 18:30:00 | 2022-01-15 20:30:00 | B | NaN |
Alice | 2022-01-16 10:30:00 | 2022-01-16 11:29:59 | B | 0 days 14 hours |
Alice | 2022-01-16 12:30:00 | 2022-01-16 15:30:00 | A | 0 days 14 hours |
Bob | 2022-01-15 10:30:00 | 2022-01-15 12:29:59 | B | NaN |
Bob | 2022-01-15 12:30:00 | 2022-01-15 14:30:00 | A | NaN |
Bob | 2022-01-15 15:30:00 | 2022-01-15 18:30:00 | A | NaN |
Bob | 2022-01-17 10:30:00 | 2022-01-17 15:30:00 | A | 1 days 16 hours |
Bob | 2022-01-17 16:30:00 | 2022-01-17 18:30:00 | B | 1 days 16 hours |
In both cases, January 15th has "NaN" because we do not know the previous shift. For Alice, there have been 14 hours between the end of her shift at 20:30h and the start of the next one the day after at 10:30. In the case of Bob, however, there is a gap day between, so the difference will be 1 day and 16 hours. This break_time column should be repeated for all the entries in a given day. I am unable to find a way to groupby the worker and then compute the last and first shift for each day. Any help would be much appreciated.
CodePudding user response:
I'm not completely sure about the structure of your data, but you could try the following (with df
your dataframe):
shifted_end = df.groupby("worker")["shift_end"].shift()
m = df["shift_start"].dt.date != shifted_end.dt.date
df["break_time"] = (
(df["shift_start"] - shifted_end)[m].reindex_like(df)
.groupby(df["worker"]).ffill()
)
- First build a series
shifted_end
with shifted values from the columnshift_end
. This has to be done grouped byworker
to avoid mixing data from different workers. - Now build a mask
m
to identify the day breaks by comparing the.dt.date
versions ofshift_start
andshifted_end
. - Then subtract the shifted
shift_end
column from theshift_start
column and extract only the results at the day breaksm
. Use.reindex_like(df)
to get the full index back, withNaN
s at the recovered indices. Then fill theNaN
s forward to propagate the day break values. This, again, has to be done grouped byworker
- otherwise the start of each worker group would get the last value from the worker before.
Result for your sample is:
worker shift_start shift_end function break_time
0 Alice 2022-01-15 11:30:00 2022-01-15 15:30:00 A NaT
1 Alice 2022-01-15 17:30:00 2022-01-15 18:29:59 A NaT
2 Alice 2022-01-15 18:30:00 2022-01-15 20:30:00 B NaT
3 Alice 2022-01-16 10:30:00 2022-01-16 11:29:59 B 0 days 14:00:00
4 Alice 2022-01-16 12:30:00 2022-01-16 15:30:00 A 0 days 14:00:00
5 Bob 2022-01-15 10:30:00 2022-01-15 12:29:59 B NaT
6 Bob 2022-01-15 12:30:00 2022-01-15 14:30:00 A NaT
7 Bob 2022-01-15 15:30:00 2022-01-15 18:30:00 A NaT
8 Bob 2022-01-17 10:30:00 2022-01-17 15:30:00 A 1 days 16:00:00
9 Bob 2022-01-17 16:30:00 2022-01-17 18:30:00 B 1 days 16:00:00