Python Pandas: Fill column's element based on same column's previous row and other column&-CodePudding

I have a dataframe with two columns, one with the status and the other with the datetime that status has began:

>>> df
  status           date_start
0    NaN  2021-12-06 09:00:00
1   busy  2021-12-06 09:17:02
2   free  2021-12-06 09:18:32
3   busy  2021-12-06 09:32:45
4   busy  2021-12-06 09:41:07
5   busy  2021-12-06 10:08:01
6   free  2021-12-06 10:17:00
7    NaN  2021-12-06 10:18:01

The dataset is already sorted by date_start, from oldest to newest.

I need to add another column that will tell me, for each row, the datetime that the "busy" period has started (date_start_busy). The rules are:

If status is "free" or "NaN", then date_start_busy is "NaN"
If status is "busy" and the previous status is "free", then date_start_busy = date_start
If status is "busy" and the previous status is also "busy", then date_start_busy should be the previous date_start_busy

The final dataframe should look like this:

>>> df
status           date_start      date_start_busy
0    NaN  2021-12-06 09:00:00                  NaN
1   busy  2021-12-06 09:17:02  2021-12-06 09:17:02
2   free  2021-12-06 09:18:32                  NaN
3   busy  2021-12-06 09:32:45  2021-12-06 09:32:45
4   busy  2021-12-06 09:41:07  2021-12-06 09:32:45
5   busy  2021-12-06 10:08:01  2021-12-06 09:32:45
6   free  2021-12-06 10:17:00                  NaN
7    NaN  2021-12-06 10:18:01                  NaN

I understand how I can accomplish this using a for loop, however my database is really large and I would like to do it in a vectorized manner in order to achieve a better performance.

Thanks in advance!

CodePudding user response：

One option is with np.select:

cond1 = df.status.isna() | df.status.eq('free')
cond2 = df.status.shift().eq('free') & df.status.eq('busy')
cond3 = df.status.shift().eq('busy') & df.status.eq('busy')

# some extra steps to take care of the third condition
# which requires picking the very first value
temp1 = temp1 = df.status.ne('busy').cumsum()
temp2 = df.status.eq('busy')
temp3 = df.date_start.groupby([temp1, temp2], sort = False).transform('first')
temp3 = np.where(temp2, temp3, np.nan)
condlist = [cond1, cond2, cond3]
choicelist = [np.nan, df.date_start, temp3]
df.assign(date_start_busy = np.select(condlist, 
                                      choicelist, 
                                      default = df.date_start)
          )

  status           date_start      date_start_busy
0    NaN  2021-12-06 09:00:00                  NaN
1   busy  2021-12-06 09:17:02  2021-12-06 09:17:02
2   free  2021-12-06 09:18:32                  NaN
3   busy  2021-12-06 09:32:45  2021-12-06 09:32:45
4   busy  2021-12-06 09:41:07  2021-12-06 09:32:45
5   busy  2021-12-06 10:08:01  2021-12-06 09:32:45
6   free  2021-12-06 10:17:00                  NaN
7    NaN  2021-12-06 10:18:01                  NaN