I have a df:
{'ID': {0: 'A',
1: 'A',
2: 'A',
3: 'B',
4: 'B',
5: 'B',
6: 'C',
7: 'C',
8: 'C',
9: 'C'},
'Date': {0: Timestamp('2020-03-02 00:00:00'),
1: Timestamp('2021-04-03 00:00:00'),
2: Timestamp('2021-04-03 00:00:00'),
3: Timestamp('2022-06-05 00:00:00'),
4: Timestamp('2022-06-05 00:00:00'),
5: Timestamp('2021-04-03 00:00:00'),
6: Timestamp('2019-03-02 00:00:00'),
7: Timestamp('2019-03-02 00:00:00'),
8: Timestamp('2017-06-01 00:00:00'),
9: Timestamp('2017-06-01 00:00:00')},
'Check': {0: 'In',
1: 'In',
2: 'Out',
3: 'Out',
4: 'Out',
5: 'In',
6: 'In ',
7: 'Out',
8: 'Out',
9: 'Out'}}
For each ID, I want to get the latest Date and all values of Check related to that latest date.
My expected result:
{'ID': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'C'},
'Date': {0: Timestamp('2021-04-03 00:00:00'),
1: Timestamp('2021-04-03 00:00:00'),
2: Timestamp('2022-06-05 00:00:00'),
3: Timestamp('2019-03-02 00:00:00'),
4: Timestamp('2019-03-02 00:00:00')},
'Check': {0: 'In', 1: 'Out', 2: 'Out', 3: 'In ', 4: 'Out'}}
I tried the below script and got the latest date with only first value of Check for each ID:
new_df= (
df.sort_values(["ID", "Date"], ascending=[False, False])
.drop_duplicates(subset = ["ID"])
.reset_index(drop=True)
)
new_df
ID Date Check
0 C 2019-03-02 In
1 B 2022-06-05 Out
2 A 2021-04-03 In
Any suggestions?
CodePudding user response:
one option is with a groupby:
(df1
.drop_duplicates()
.merge(
df1.groupby('ID').Date.max(),
on = ['ID', 'Date'])
)
ID Date Check
0 A 2021-04-03 In
1 A 2021-04-03 Out
2 B 2022-06-05 Out
3 C 2019-03-02 In
4 C 2019-03-02 Out