Create a new column with the most recent row in groupby when a column value condition is met-CodePudding

I want to create a new column that records the most recent/previous date value when obs_occurred==1 for each person (i.e. pid). For example the shift column below is what I want:

obs_occurred    pid         date           shift
1.0             10000001    2016-01-06     2016-01-06
1.0             10000001    2016-01-07     2016-01-06
0.0             10000001    2016-01-08     2016-01-07
1.0             10000001    2016-01-09     2016-01-07
1.0             10000001    2016-01-10     2016-01-09
...             ...         ...

I can use groupby()[].shift() to create a new column that has the date from the previous row for each pid but how can I add a condition that obs_occurred==1?

CodePudding user response：

Conditional shift:

At first, I even try to write a recursive function to shift until meet the criteria, eventually a thought struck that this is a good case using ffill because it needs the last nearest one.

df['modify_date'] = np.where(df['obs_occurred'].shift() == 1, df['date'].shift(1), np.datetime64('NaT'))
df.fillna(method='ffill', inplace=True)
df.fillna(method='bfill', inplace=True)
df
###
   obs_occurred       pid       date modify_date
0           1.0  10000001 2016-01-06  2016-01-06
1           1.0  10000001 2016-01-07  2016-01-06
2           0.0  10000001 2016-01-08  2016-01-07
3           1.0  10000001 2016-01-09  2016-01-07
4           1.0  10000001 2016-01-10  2016-01-09

Conditional shift within group:

df2
###
   obs_occurred       pid       date
0             1  10000001 2016-01-06
1             1  10000001 2016-01-07
2             0  10000001 2016-01-08
3             1  10000001 2016-01-09
4             1  10000001 2016-01-10
5             1  10000002 2016-01-06
6             1  10000002 2016-01-07
7             0  10000002 2016-01-08
8             1  10000002 2016-01-09
9             1  10000002 2016-01-10

df2['md_idg'] = np.where(df2.groupby('pid')['obs_occurred'].shift() == 1, df2.groupby('pid')['date'].shift(1), np.datetime64('NaT'))
df2['md_idg'] = df2.groupby('pid')['md_idg'].transform(lambda x: x.ffill().bfill())

df2
###
   obs_occurred       pid       date     md_idg
0             1  10000001 2016-01-06 2016-01-06
1             1  10000001 2016-01-07 2016-01-06
2             0  10000001 2016-01-08 2016-01-07
3             1  10000001 2016-01-09 2016-01-07
4             1  10000001 2016-01-10 2016-01-09
5             1  10000002 2016-01-06 2016-01-06
6             1  10000002 2016-01-07 2016-01-06
7             0  10000002 2016-01-08 2016-01-07
8             1  10000002 2016-01-09 2016-01-07
9             1  10000002 2016-01-10 2016-01-09

CodePudding user response：

First, I created a dictionary assigning each pid to the last date when obs_occured is 1

max_dates = {index: value for index, value in df[df.obs_occurred == '1'].groupby('pid').date.max().items()}

Then add a column by just applying this dictionary to pids

df['last_date'] = df.pid.apply(lambda x: max_dates[x])