Home > database >  Get last day where user was active in pd DataFrame
Get last day where user was active in pd DataFrame

Time:09-21

If I have a df with just days of activity I can add a column pev_day by

# make sure vals are sorted
df = df.sort_values(['id', 'date'])
# add col with prev date
df['prev_date'] = df.groupby('id')['date'].shift()

Now I have a different dataset where I have not just the "active" days but some days of inactivity in between. How could I get around this issue to achieve this:

id date       active prev_date
A  01-01-2021 1      NaN
A  02-01-2021 1      01-01-2021
A  03-01-2021 0      02-01-2021
A  04-01-2021 0      02-01-2021
A  05-01-2021 1      02-01-2021
A  06-01-2021 1      05-01-2021
[..]
B  01-01-2021 1      NaN
B  02-01-2021 0      01-01-2021
[..]

CodePudding user response:

IIUC, one way is to use where to mask the dates without activity and ffill to propagate the date with the last activity. Then like you did with groupby.shift.

# dummy data
df = pd.DataFrame({
    'id':['A']*6   ['B']*6,
    'date': pd.date_range('2021-07-01', freq='d', periods=12),
    'activity': [1,1,0,0,0,1,1,0,0,1,0,1]
})

# create the prev_date column
df['prev_date'] = (
    df['date'].where(df['activity'].astype(bool)) #mask dates without activity
      .ffill() # propagate last date with activity
      .groupby(df['id']).shift() # shift the dates per group of id
)
print(df)
   id       date  activity  prev_date
0   A 2021-07-01         1        NaT
1   A 2021-07-02         1 2021-07-01
2   A 2021-07-03         0 2021-07-02
3   A 2021-07-04         0 2021-07-02
4   A 2021-07-05         0 2021-07-02
5   A 2021-07-06         1 2021-07-02
6   B 2021-07-07         1        NaT
7   B 2021-07-08         0 2021-07-07
8   B 2021-07-09         0 2021-07-07
9   B 2021-07-10         1 2021-07-07
10  B 2021-07-11         0 2021-07-10
11  B 2021-07-12         1 2021-07-10
  • Related