I have a datframe that looks like this
user_id product_id created_at
1 100 2019-04-21 20:20:00
1 100 2019-04-23 00:10:00
1 200 2019-05-24 10:00:00
1 200 2020-06-24 10:10:24
2 100 2019-01-22 21:10:00
2 200 2019-04-25 20:23:30
2 300 2021-01-21 10:20:00
3 400 2019-12-21 10:20:00
3 400 2021-04-21 10:20:00
I'm trying to find the most bought product_id
for each user_id
. I know I can use groupby along with value_counts
or pd.Series.mode
to get this. something like this:
df.groupby(['user_id'])['product_id',].apply(lambda x: x.value_counts().index[0]).reset_index()
This works fine until I reach a user_id
having two or more product_id
with the same count( like in the sample dataframe). When this happens, ideally I'd want to select the product_id
with the most recent created_at
. So for eg. for user_id
1, I'd want to select product_id
200 as it was last bought more recently.
What would be the most optimal way to achieve this?
CodePudding user response:
You can calculate both count
and max
on dates, then sort on these values and drop duplicates (or use groupby().head()):
s = df.groupby(['user_id','product_id'])['created_at'].agg(['count','max'])
s.sort_values(['count','max'], ascending=False).groupby('user_id').head(1)
Output:
count max
user_id product_id
3 400 2 2021-04-21 10:20:00
1 200 2 2020-06-24 10:10:24
2 300 1 2021-01-21 10:20:00
CodePudding user response:
df.\
sort_values("created_at", ascending=False).\
groupby(["user_id", "product_id"], sort=False, as_index=False).\
count().\
groupby("user_id", as_index=False).\
head(1)
# user_id product_id created_at
# 0 1 200 2
# 1 2 300 1
# 2 3 400 2