Home > Back-end >  How to group by first column, select latest value of second column, and all respective values of thi
How to group by first column, select latest value of second column, and all respective values of thi

Time:12-17

I have a df:

    {'ID': {0: 'A',
  1: 'A',
  2: 'A',
  3: 'B',
  4: 'B',
  5: 'B',
  6: 'C',
  7: 'C',
  8: 'C',
  9: 'C'},
 'Date': {0: Timestamp('2020-03-02 00:00:00'),
  1: Timestamp('2021-04-03 00:00:00'),
  2: Timestamp('2021-04-03 00:00:00'),
  3: Timestamp('2022-06-05 00:00:00'),
  4: Timestamp('2022-06-05 00:00:00'),
  5: Timestamp('2021-04-03 00:00:00'),
  6: Timestamp('2019-03-02 00:00:00'),
  7: Timestamp('2019-03-02 00:00:00'),
  8: Timestamp('2017-06-01 00:00:00'),
  9: Timestamp('2017-06-01 00:00:00')},
 'Check': {0: 'In',
  1: 'In',
  2: 'Out',
  3: 'Out',
  4: 'Out',
  5: 'In',
  6: 'In ',
  7: 'Out',
  8: 'Out',
  9: 'Out'}}

For each ID, I want to get the latest Date and all values of Check related to that latest date.
My expected result:

    {'ID': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'C'},
 'Date': {0: Timestamp('2021-04-03 00:00:00'),
  1: Timestamp('2021-04-03 00:00:00'),
  2: Timestamp('2022-06-05 00:00:00'),
  3: Timestamp('2019-03-02 00:00:00'),
  4: Timestamp('2019-03-02 00:00:00')},
 'Check': {0: 'In', 1: 'Out', 2: 'Out', 3: 'In ', 4: 'Out'}}

I tried the below script and got the latest date with only first value of Check for each ID:

new_df= (
df.sort_values(["ID", "Date"], ascending=[False, False])
.drop_duplicates(subset = ["ID"])
.reset_index(drop=True)
) 

 new_df

    ID  Date        Check
0   C   2019-03-02  In
1   B   2022-06-05  Out
2   A   2021-04-03  In

Any suggestions?

CodePudding user response:

one option is with a groupby:

(df1
.drop_duplicates()
.merge(
    df1.groupby('ID').Date.max(), 
    on = ['ID', 'Date'])
)
  ID       Date Check
0  A 2021-04-03    In
1  A 2021-04-03   Out
2  B 2022-06-05   Out
3  C 2019-03-02   In
4  C 2019-03-02   Out
  • Related