Get last day where user was active in pd DataFrame-CodePudding

If I have a df with just days of activity I can add a column pev_day by

# make sure vals are sorted
df = df.sort_values(['id', 'date'])
# add col with prev date
df['prev_date'] = df.groupby('id')['date'].shift()

Now I have a different dataset where I have not just the "active" days but some days of inactivity in between. How could I get around this issue to achieve this:

id date       active prev_date
A  01-01-2021 1      NaN
A  02-01-2021 1      01-01-2021
A  03-01-2021 0      02-01-2021
A  04-01-2021 0      02-01-2021
A  05-01-2021 1      02-01-2021
A  06-01-2021 1      05-01-2021
[..]
B  01-01-2021 1      NaN
B  02-01-2021 0      01-01-2021
[..]

CodePudding user response：

IIUC, one way is to use where to mask the dates without activity and ffill to propagate the date with the last activity. Then like you did with groupby.shift.

# dummy data
df = pd.DataFrame({
    'id':['A']*6   ['B']*6,
    'date': pd.date_range('2021-07-01', freq='d', periods=12),
    'activity': [1,1,0,0,0,1,1,0,0,1,0,1]
})

# create the prev_date column
df['prev_date'] = (
    df['date'].where(df['activity'].astype(bool)) #mask dates without activity
      .ffill() # propagate last date with activity
      .groupby(df['id']).shift() # shift the dates per group of id
)
print(df)
   id       date  activity  prev_date
0   A 2021-07-01         1        NaT
1   A 2021-07-02         1 2021-07-01
2   A 2021-07-03         0 2021-07-02
3   A 2021-07-04         0 2021-07-02
4   A 2021-07-05         0 2021-07-02
5   A 2021-07-06         1 2021-07-02
6   B 2021-07-07         1        NaT
7   B 2021-07-08         0 2021-07-07
8   B 2021-07-09         0 2021-07-07
9   B 2021-07-10         1 2021-07-07
10  B 2021-07-11         0 2021-07-10
11  B 2021-07-12         1 2021-07-10