Home > other >  How to fill last non-null value for each user in pandas?
How to fill last non-null value for each user in pandas?

Time:08-15

I have a df with user journeys that show purchase amounts of products. Now, I want to fill the last non-null value for each user, since users do not buy every day. currently, I have:

date       | user_id | purchase_value
2020-01-01 | 1       | null
2020-01-02 | 1       | 1
2020-01-03 | 1       | null
2020-01-04 | 1       | 4
2020-01-01 | 2       | 55
2020-01-02 | 2       | null

I want it to look like this:

date       | user_id | purchase_value
2020-01-01 | 1       | null
2020-01-02 | 1       | 1
2020-01-03 | 1       | 1
2020-01-04 | 1       | 4
2020-01-01 | 2       | 55
2020-01-02 | 2       | 55

Explanation: For user 1, we fill 1 on 2020-01-03 since this was the last non-null value on 2020-01-02. For user 2, we fill in 55 on 2020-01-02 since this was the last non-null value on 2020-01-01.

How would I do this in pandas for each user_id and date? Also, the dates do not have to be sequential. i.e. there can be gaps in the dates, in that case always fill in the last non-null value (whenever that was).

CodePudding user response:

If you really want to ffill only the last NaN per group you need to identify it, then replace with its ffill:

# is the value NaN?
m1 = df['purchase_value'].isna()

# is this the last NaN of the group?
# here: is this the first NaN of the group in reverse?
m2 = m1[::-1].groupby(df['user_id']).cumsum().eq(1)

# then replace with the ffill per group
df.loc[m1&m2, 'purchase_value'] = df.groupby(['user_id'])['purchase_value'].ffill()

Output:

         date  user_id  purchase_value
0  2020-01-01        1             NaN
1  2020-01-02        1             1.0
2  2020-01-03        1             1.0
3  2020-01-04        1             4.0
4  2020-01-01        2            55.0
5  2020-01-02        2            55.0
  • Related