I want to create a new column that records the most recent/previous date
value when obs_occurred
==1 for each person (i.e. pid
). For example the shift column below is what I want:
obs_occurred pid date shift
1.0 10000001 2016-01-06 2016-01-06
1.0 10000001 2016-01-07 2016-01-06
0.0 10000001 2016-01-08 2016-01-07
1.0 10000001 2016-01-09 2016-01-07
1.0 10000001 2016-01-10 2016-01-09
... ... ...
I can use groupby()[].shift() to create a new column that has the date from the previous row for each pid but how can I add a condition that obs_occurred
==1?
CodePudding user response:
Conditional shift:
At first, I even try to write a recursive function to shift until meet the criteria, eventually
a thought struck that this is a good case using ffill
because it needs the last nearest one.
df['modify_date'] = np.where(df['obs_occurred'].shift() == 1, df['date'].shift(1), np.datetime64('NaT'))
df.fillna(method='ffill', inplace=True)
df.fillna(method='bfill', inplace=True)
df
###
obs_occurred pid date modify_date
0 1.0 10000001 2016-01-06 2016-01-06
1 1.0 10000001 2016-01-07 2016-01-06
2 0.0 10000001 2016-01-08 2016-01-07
3 1.0 10000001 2016-01-09 2016-01-07
4 1.0 10000001 2016-01-10 2016-01-09
Conditional shift within group:
df2
###
obs_occurred pid date
0 1 10000001 2016-01-06
1 1 10000001 2016-01-07
2 0 10000001 2016-01-08
3 1 10000001 2016-01-09
4 1 10000001 2016-01-10
5 1 10000002 2016-01-06
6 1 10000002 2016-01-07
7 0 10000002 2016-01-08
8 1 10000002 2016-01-09
9 1 10000002 2016-01-10
df2['md_idg'] = np.where(df2.groupby('pid')['obs_occurred'].shift() == 1, df2.groupby('pid')['date'].shift(1), np.datetime64('NaT'))
df2['md_idg'] = df2.groupby('pid')['md_idg'].transform(lambda x: x.ffill().bfill())
df2
###
obs_occurred pid date md_idg
0 1 10000001 2016-01-06 2016-01-06
1 1 10000001 2016-01-07 2016-01-06
2 0 10000001 2016-01-08 2016-01-07
3 1 10000001 2016-01-09 2016-01-07
4 1 10000001 2016-01-10 2016-01-09
5 1 10000002 2016-01-06 2016-01-06
6 1 10000002 2016-01-07 2016-01-06
7 0 10000002 2016-01-08 2016-01-07
8 1 10000002 2016-01-09 2016-01-07
9 1 10000002 2016-01-10 2016-01-09
CodePudding user response:
First, I created a dictionary assigning each pid to the last date when obs_occured is 1
max_dates = {index: value for index, value in df[df.obs_occurred == '1'].groupby('pid').date.max().items()}
Then add a column by just applying this dictionary to pids
df['last_date'] = df.pid.apply(lambda x: max_dates[x])