Home > Blockchain >  pandas groupby: selecting unique latest entries
pandas groupby: selecting unique latest entries

Time:01-26

In the following pandas Data Frame:

    Name    v   date_modified
0   A   0   2023-01-01
1   A   1   2023-01-02
2   A   2   2023-01-03
3   B   0   2023-01-30
4   B   1   2023-01-02
5   B   2   2023-01-03
6   C   0   2023-01-30
7   C   1   2023-01-03
8   C   2   2023-01-03

How can I get two latest versions with most recent unique date_modified per group ['Name', 'v']?

In this example there are duplicates date_modified on df.Name == C. So far I tired to do something like this: df.sort_values('date_modified').groupby(['Name', 'v']).tail(2). This does not omit duplicates on date_modified and also for some reason return all rows not just tail of two

CodePudding user response:

IIUC, you have to drop some duplicates before:

>>> (df.drop_duplicates(['Name', 'date_modified'], keep='first')
       .sort_values('date_modified').groupby('Name').tail(2).sort_index())

  Name  v date_modified
1    A  1    2023-01-02
2    A  2    2023-01-03
3    B  0    2023-01-30
5    B  2    2023-01-03
6    C  0    2023-01-30
7    C  1    2023-01-03
  • Related