I have a pandas DataFrame of the form
quarter | user_id | # Sessions |
---|---|---|
2022 Q1 | 1 | 9 |
2021 Q4 | 1 | |
2021 Q3 | 1 | |
2022 Q1 | 2 | 8 |
2021 Q4 | 2 | |
2021 Q3 | 2 |
And I'd like to forward fill the # Sessions column within each user_id to get a table like:
quarter | user_id | # Sessions |
---|---|---|
2022 Q1 | 1 | 9 |
2021 Q4 | 1 | 9 |
2021 Q3 | 1 | 9 |
2022 Q1 | 2 | 8 |
2021 Q4 | 2 | 8 |
2021 Q3 | 2 | 8 |
I can do this with
x.groupby('user_id').apply(lambda x: x.fillna({'# Sessions': x['# Sessions'].ffill()}))
But I have reasonably big data (~10k users), and this is a very common operation in the codebase.
Is .groupby
necessary or is there a more performant way to achieve the same?
CodePudding user response:
You could also try transforming first
(since first
skips NaNs):
df['# Sessions'] = df.groupby('user_id')['# Sessions'].transform('first')
Output:
quarter user_id # Sessions
0 2022 Q1 1 9.0
1 2021 Q4 1 9.0
2 2021 Q3 1 9.0
3 2022 Q1 2 8.0
4 2021 Q4 2 8.0
5 2021 Q3 2 8.0
CodePudding user response:
We can avoid use groupby, First we use DataFrame.sort_values
,
Then we just use ffill except for the rows where user_id
change occurs.
df2 = df.sort_values('user_id')
df['# Sessions'] = df2['# Sessions'].ffill()\
.where(df2['user_id'].eq(df2['user_id'].shift()),
df2['# Sessions'])
CodePudding user response:
This should be much faster than what you're doing:
x['# Sessions'] = x.groupby('user_id')['# Sessions'].ffill()