I need to group by user_id alone and display content_id which is of the latest date. When I tried like below, one of the user is getting correct values but another user (123) is getting wrong value. Expected is content_id - 20 but 21 is coming. Please help how to resolve this.
Sample data:
data = {'user_id':['123', '123', '234', '234'],
'content_id':[20, 21, 19, 18],
'date':['2020-10-14', '2019-01-25','2021-05-26','2018-03-26']}
df = pd.DataFrame(data)
df
user_id content_id date
0 123 20 2020-10-14
1 123 21 2019-01-25
2 234 19 2021-05-26
3 234 18 2018-03-26
df.sort_values(['user_id','content_id','date']).groupby(['user_id'])['content_id','date'].last()
user_id content_id date
123 21 2019-01-25
234 19 2021-05-26
CodePudding user response:
If date
has higher precendence than content_id
, use that fact in sort_values
:
out = df.sort_values(['user_id','date','content_id']).groupby(['user_id'])[['content_id','date']].last()
Another possibility is to convert date
to datetime and the find the latest date's index using groupby
idxmax
; then use loc
to filter the desired output:
df['date'] = pd.to_datetime(df['date'])
out = df.loc[df.groupby('user_id')['date'].idxmax()]
Output:
content_id date
user_id
123 20 2020-10-14
234 19 2021-05-26