Home > Back-end >  Finding datetimes since event with conditions in pandas
Finding datetimes since event with conditions in pandas

Time:12-30

I have a dataframe in pandas which reflects work shifts for employees (the time they are actually working). A snippet of it is the following:

df = pd.DataFrame({'Worker' : ['Alice','Alice','Alice', 'Bob','Bob','Bob'],
                          'Shift_start' : ['2022-01-01 10:00:00', '2022-01-01 13:10:00', '2022-01-01 15:45:00', '2022-01-01 11:30:00', '2022-01-01 13:40:00', '2022-01-01 15:20:00'],
                          'Shift_end' : ['2022-01-01 12:30:00', '2022-01-01 15:30:00', '2022-01-01 17:30:00', '2022-01-01 13:30:00', '2022-01-01 15:10:00', '2022-01-01 18:10:00']})
Worker Shift_start Shift_end
Alice 2022-01-01 10:00:00 2022-01-01 12:30:00
Alice 2022-01-01 13:10:00 2022-01-01 15:30:00
Alice 2022-01-01 15:45:00 2022-01-01 17:30:00
Bob 2022-01-01 11:30:00 2022-01-01 13:30:00
Bob 2022-01-01 13:40:00 2022-01-01 15:10:00
Bob 2022-01-01 15:20:00 2022-01-01 18:10:00

Now, I need to compute in very row the time since the last partial break, defined as a pause >20 minutes, and computed with respect to the start time of each shift. This is, if there is a pause of 15 minutes it should be considered that the pause has not existed, and the time would be computed since the last >20 min pause. If no pause exists, the time should be taken as the time since the start of the day. So I would need something like:

Worker Shift_start Shift_end Hours_since_break
Alice 2022-01-01 10:00:00 2022-01-01 12:30:00 0
Alice 2022-01-01 13:10:00 2022-01-01 15:30:00 0
Alice 2022-01-01 15:45:00 2022-01-01 17:30:00 2.58
Bob 2022-01-01 11:30:00 2022-01-01 13:30:00 0
Bob 2022-01-01 13:40:00 2022-01-01 15:10:00 2.17
Bob 2022-01-01 15:20:00 2022-01-01 18:10:00 3.83

For Alice, the first row is 0, as there is no previous break, so it is taken as the value since the start of the day. As it is her first shift, 0 hours is the result. In the second row, she has just taken a 40-minute pause, so again, 0 hours since the break. In the third row, she has just taken 15 minutes, but as the minimum break is 20 minutes, it is as if she hadn't take any break. Therefore, the time since her last break is since 13:10:00, when her last break finished, so the result is 2 hours and 35 minutes, i.e., 2.58 hours.

In the case of Bob the same logic applies. The first row is 0 (is the first shift of the day). In the second row he has taken just a 10-minute break which doesn't count, so the time since the last break would be since the start of his day, i.e., 2h10m (2.17 hours). In the third row, he has taken a 10-minute break again, so the time would be again since the start of the day, so 3h50m (3.83 hours).

To compute the breaks with the 20-minute constraint I did the following:

shifted_end = df.groupby("Worker")["Shift_end"].shift()
df["Partial_break"] = (df["Shift_start"] - shifted_end)
df['Partial_break_hours'] = df["Partial_break"].dt.total_seconds() / 3600
df.loc[(df['Partial_break_hours']<0.33), 'Partial_break_hours'] = 0

But I can't think of a way to implement the search logic to give the desired output. Any help is much appreciated!

CodePudding user response:

You can try (assuming the DataFrame is sorted):

def fn(x):
    rv = []
    last_zero = 0

    for a, c in zip(
        x["Shift_start"],
        (x["Shift_start"] - x["Shift_end"].shift()) < "20 minutes",
    ):
        if c:
            rv.append(round((a - last_zero) / pd.to_timedelta(1, unit="hour"), 2))
        else:
            last_zero = a
            rv.append(0)

    return pd.Series(rv, index=x.index)


df["Hours_since_break"] = df.groupby("Worker").apply(fn).droplevel(0)
print(df)

Prints:

  Worker         Shift_start           Shift_end  Hours_since_break
0  Alice 2022-01-01 10:00:00 2022-01-01 12:30:00               0.00
1  Alice 2022-01-01 13:10:00 2022-01-01 15:30:00               0.00
2  Alice 2022-01-01 15:45:00 2022-01-01 17:30:00               2.58
3    Bob 2022-01-01 11:30:00 2022-01-01 13:30:00               0.00
4    Bob 2022-01-01 13:40:00 2022-01-01 15:10:00               2.17
5    Bob 2022-01-01 15:20:00 2022-01-01 18:10:00               3.83

CodePudding user response:

You could calculate a "fullBreakAtStart" flag. And based on that, set a "lastShiftStart". If there is no "fullBreakAtStart", then just enter a np.nan, and then use the fillna(method="ffill") function. Here is the code:

df["Shift_end_prev"] = df.groupby("Worker")["Shift_end"].shift(1)

df["timeDiff"] = pd.to_datetime(df["Shift_start"]) - pd.to_datetime(df["Shift_end_prev"])
df["fullBreakAtStart"] = (df["timeDiff"]> "20 minutes") | (df["timeDiff"].isna())

df["lastShiftStart"] = np.where(df["fullBreakAtStart"], df["Shift_start"], np.nan)
df["lastShiftStart"] = df["lastShiftStart"].fillna(method="ffill")

df["Hours_since_break"] = pd.to_datetime(df["Shift_start"]) - pd.to_datetime(df["lastShiftStart"])
df["Hours_since_break"] = df["Hours_since_break"]/np.timedelta64(1, 'h')
df["Hours_since_break"] = np.where(df["fullBreakAtStart"],0,df["Hours_since_break"])
  • Related