I have a dataframe with two columns, one with the status and the other with the datetime that status has began:
>>> df
status date_start
0 NaN 2021-12-06 09:00:00
1 busy 2021-12-06 09:17:02
2 free 2021-12-06 09:18:32
3 busy 2021-12-06 09:32:45
4 busy 2021-12-06 09:41:07
5 busy 2021-12-06 10:08:01
6 free 2021-12-06 10:17:00
7 NaN 2021-12-06 10:18:01
The dataset is already sorted by date_start
, from oldest to newest.
I need to add another column that will tell me, for each row, the datetime that the "busy" period has started (date_start_busy
). The rules are:
- If status is "free" or "NaN", then
date_start_busy
is "NaN" - If status is "busy" and the previous status is "free", then
date_start_busy
=date_start
- If status is "busy" and the previous status is also "busy", then
date_start_busy
should be the previousdate_start_busy
The final dataframe should look like this:
>>> df
status date_start date_start_busy
0 NaN 2021-12-06 09:00:00 NaN
1 busy 2021-12-06 09:17:02 2021-12-06 09:17:02
2 free 2021-12-06 09:18:32 NaN
3 busy 2021-12-06 09:32:45 2021-12-06 09:32:45
4 busy 2021-12-06 09:41:07 2021-12-06 09:32:45
5 busy 2021-12-06 10:08:01 2021-12-06 09:32:45
6 free 2021-12-06 10:17:00 NaN
7 NaN 2021-12-06 10:18:01 NaN
I understand how I can accomplish this using a for loop, however my database is really large and I would like to do it in a vectorized manner in order to achieve a better performance.
Thanks in advance!
CodePudding user response:
One option is with np.select
:
cond1 = df.status.isna() | df.status.eq('free')
cond2 = df.status.shift().eq('free') & df.status.eq('busy')
cond3 = df.status.shift().eq('busy') & df.status.eq('busy')
# some extra steps to take care of the third condition
# which requires picking the very first value
temp1 = temp1 = df.status.ne('busy').cumsum()
temp2 = df.status.eq('busy')
temp3 = df.date_start.groupby([temp1, temp2], sort = False).transform('first')
temp3 = np.where(temp2, temp3, np.nan)
condlist = [cond1, cond2, cond3]
choicelist = [np.nan, df.date_start, temp3]
df.assign(date_start_busy = np.select(condlist,
choicelist,
default = df.date_start)
)
status date_start date_start_busy
0 NaN 2021-12-06 09:00:00 NaN
1 busy 2021-12-06 09:17:02 2021-12-06 09:17:02
2 free 2021-12-06 09:18:32 NaN
3 busy 2021-12-06 09:32:45 2021-12-06 09:32:45
4 busy 2021-12-06 09:41:07 2021-12-06 09:32:45
5 busy 2021-12-06 10:08:01 2021-12-06 09:32:45
6 free 2021-12-06 10:17:00 NaN
7 NaN 2021-12-06 10:18:01 NaN