Home > OS >  Pandas groupby a column and sort by date and get only the latest row
Pandas groupby a column and sort by date and get only the latest row

Time:03-27

I need to group by user_id alone and display content_id which is of the latest date. When I tried like below, one of the user is getting correct values but another user (123) is getting wrong value. Expected is content_id - 20 but 21 is coming. Please help how to resolve this.

Sample data:

data = {'user_id':['123', '123', '234', '234'],
        'content_id':[20, 21, 19, 18],
        'date':['2020-10-14', '2019-01-25','2021-05-26','2018-03-26']}
 

df = pd.DataFrame(data)
df


    user_id content_id  date
0   123     20          2020-10-14
1   123     21          2019-01-25
2   234     19          2021-05-26
3   234     18          2018-03-26


df.sort_values(['user_id','content_id','date']).groupby(['user_id'])['content_id','date'].last()


user_id content_id  date
        
123     21         2019-01-25
234     19         2021-05-26

CodePudding user response:

If date has higher precendence than content_id, use that fact in sort_values:

out = df.sort_values(['user_id','date','content_id']).groupby(['user_id'])[['content_id','date']].last()

Another possibility is to convert date to datetime and the find the latest date's index using groupby idxmax; then use loc to filter the desired output:

df['date'] = pd.to_datetime(df['date'])
out = df.loc[df.groupby('user_id')['date'].idxmax()]

Output:

         content_id        date
user_id                        
123              20  2020-10-14
234              19  2021-05-26
  • Related