In the following pandas Data Frame:
Name v date_modified
0 A 0 2023-01-01
1 A 1 2023-01-02
2 A 2 2023-01-03
3 B 0 2023-01-30
4 B 1 2023-01-02
5 B 2 2023-01-03
6 C 0 2023-01-30
7 C 1 2023-01-03
8 C 2 2023-01-03
How can I get two latest versions with most recent unique date_modified per group ['Name', 'v']?
In this example there are duplicates date_modified on df.Name == C
. So far I tired to do something like this:
df.sort_values('date_modified').groupby(['Name', 'v']).tail(2)
. This does not omit duplicates on date_modified and also for some reason return all rows not just tail of two
CodePudding user response:
IIUC, you have to drop some duplicates before:
>>> (df.drop_duplicates(['Name', 'date_modified'], keep='first')
.sort_values('date_modified').groupby('Name').tail(2).sort_index())
Name v date_modified
1 A 1 2023-01-02
2 A 2 2023-01-03
3 B 0 2023-01-30
5 B 2 2023-01-03
6 C 0 2023-01-30
7 C 1 2023-01-03