If I have a df with just days of activity I can add a column pev_day
by
# make sure vals are sorted
df = df.sort_values(['id', 'date'])
# add col with prev date
df['prev_date'] = df.groupby('id')['date'].shift()
Now I have a different dataset where I have not just the "active" days but some days of inactivity in between. How could I get around this issue to achieve this:
id date active prev_date
A 01-01-2021 1 NaN
A 02-01-2021 1 01-01-2021
A 03-01-2021 0 02-01-2021
A 04-01-2021 0 02-01-2021
A 05-01-2021 1 02-01-2021
A 06-01-2021 1 05-01-2021
[..]
B 01-01-2021 1 NaN
B 02-01-2021 0 01-01-2021
[..]
CodePudding user response:
IIUC, one way is to use where
to mask the dates without activity and ffill
to propagate the date with the last activity. Then like you did with groupby.shift
.
# dummy data
df = pd.DataFrame({
'id':['A']*6 ['B']*6,
'date': pd.date_range('2021-07-01', freq='d', periods=12),
'activity': [1,1,0,0,0,1,1,0,0,1,0,1]
})
# create the prev_date column
df['prev_date'] = (
df['date'].where(df['activity'].astype(bool)) #mask dates without activity
.ffill() # propagate last date with activity
.groupby(df['id']).shift() # shift the dates per group of id
)
print(df)
id date activity prev_date
0 A 2021-07-01 1 NaT
1 A 2021-07-02 1 2021-07-01
2 A 2021-07-03 0 2021-07-02
3 A 2021-07-04 0 2021-07-02
4 A 2021-07-05 0 2021-07-02
5 A 2021-07-06 1 2021-07-02
6 B 2021-07-07 1 NaT
7 B 2021-07-08 0 2021-07-07
8 B 2021-07-09 0 2021-07-07
9 B 2021-07-10 1 2021-07-07
10 B 2021-07-11 0 2021-07-10
11 B 2021-07-12 1 2021-07-10