how to do lookup on two pandas dataframe and update its value in first dataframe column from another-CodePudding

I have 2 dataframes df and df1 -

df-

|system_time|status|id|date|
|2022-03-04T07:52:26Z|Pending|772|2022-03-04 07:52:26 00:00|
|2022-06-22T17:52:42Z|Pending|963|2022-06-22 17:52:42 00:00|
|2022-08-13T01:34:44Z|Pending|1052|2022-08-13 01:34:44 00:00|
|2022-08-24T01:46:31.115Z|Complete|1052|2022-08-24 01:46:31.115000 00:00|
|2022-08-14T06:04:54.736Z|Pending|1053|2022-08-14 06:04:54.736000 00:00|
|2022-03-04T17:51:15.025Z|Pending|772|2022-03-04 17:51:15.025000 00:00|
|2022-08-24T06:24:54.736Z|Inprogress|999|2022-08-24 06:24:54.736000 00:00|

df1-

|id|task_status|
|1052|Complete|
|889|Pending|
|772|Complete|
|963|Pending|

Type of columns in df -

system_time  - object
status - object
id    - int64
date   - object

I want to apply a lookup here from df into df1. If id matches in df and df1,status in df should be of task_status from df1. As there are duplicate records in df, need to get the latest record and update the status as of df1 else keep status same as df for unmatched id's. In df, I have converted the system_time into date column using -

df['date']=pd.to_datetime(df['system_time'])

Expected output -

|system_time|status|id|date|
|2022-06-22T17:52:42Z|Pending|963|2022-06-22 17:52:42 00:00|
|2022-08-24T01:46:31.115Z|Complete|1052|2022-08-24 01:46:31.115000 00:00|
|2022-08-14T06:04:54.736Z|Pending|1053|2022-08-14 06:04:54.736000 00:00|
|2022-03-04T17:51:15.025Z|Complete|772|2022-03-04 17:51:15.025000 00:00|
|2022-08-24T06:24:54.736Z|Inprogress|999|2022-08-24 06:24:54.736000 00:00|

CodePudding user response：

here is one way to do it using map

# map the df1 status to df when ID is found
df['status']=df['i'].map(df1.set_index(['id'])['task_status'])
df

    system_time     status  id  date
0   2022-03-04T07:52:26Z    Complete    772     2022-03-04 07:52:26 00:00
1   2022-06-22T17:52:42Z    Pending     963     2022-06-22 17:52:42 00:00
2   2022-08-13T01:34:44Z    Complete    1052    2022-08-13 01:34:44 00:00
3   2022-08-24T01:46:31.115Z    Complete    1052    2022-08-24 01:46:31.115000 00:00
4   2022-08-14T06:04:54.736Z    NaN     1053    2022-08-14 06:04:54.736000 00:00
5   2022-03-04T17:51:15.025Z    Complete    772     2022-03-04 17:51:15.025000 00:00
6   2022-08-24T06:24:54.736Z    NaN     999     2022-08-24 06:24:54.736000 00:00

Alternately, if you like to update only when status is found in DF1

df['status']=df['status'].mask((df['id'].map(df1.set_index(['id'])['task_status']).notna()), 
                           (df['id'].map(df1.set_index(['id'])['task_status'])) )
df

    system_time     status  id  date
0   2022-03-04T07:52:26Z    Complete    772     2022-03-04 07:52:26 00:00
1   2022-06-22T17:52:42Z    Pending     963     2022-06-22 17:52:42 00:00
2   2022-08-13T01:34:44Z    Complete    1052    2022-08-13 01:34:44 00:00
3   2022-08-24T01:46:31.115Z    Complete    1052    2022-08-24 01:46:31.115000 00:00
4   2022-08-14T06:04:54.736Z    Pending     1053    2022-08-14 06:04:54.736000 00:00
5   2022-03-04T17:51:15.025Z    Complete    772     2022-03-04 17:51:15.025000 00:00
6   2022-08-24T06:24:54.736Z    Inprogress  999     2022-08-24 06:24:54.736000 00:00

for unmatched,

#update 'unmatched' column as unmatched when id is NOT found. 
#When its found, keep the status as-is.
#you may want to keep the previous one and this one together

df['unmatched']=df['status'].mask((df['id'].map(df1.set_index(['id'])['task_status']).isna()), 
                           'unmatched' )
df

system_time     status  id  date    unmatched
0   2022-03-04T07:52:26Z    Complete    772     2022-03-04 07:52:26 00:00   Complete
1   2022-06-22T17:52:42Z    Pending     963     2022-06-22 17:52:42 00:00   Pending
2   2022-08-13T01:34:44Z    Complete    1052    2022-08-13 01:34:44 00:00   Complete
3   2022-08-24T01:46:31.115Z    Complete    1052    2022-08-24 01:46:31.115000 00:00    Complete
4   2022-08-14T06:04:54.736Z    Pending     1053    2022-08-14 06:04:54.736000 00:00    unmatched
5   2022-03-04T17:51:15.025Z    Complete    772     2022-03-04 17:51:15.025000 00:00    Complete
6   2022-08-24T06:24:54.736Z    Inprogress  999     2022-08-24 06:24:54.736000 00:00    unmatched

to keep the last row based on the system-time

df.sort_values('system_time').drop_duplicates(subset=['id'], keep='last')

system_time     status  id  date    unmatched
5   2022-03-04T17:51:15.025Z    Pending     772     2022-03-04 17:51:15.025000 00:00    Pending
1   2022-06-22T17:52:42Z    Pending     963     2022-06-22 17:52:42 00:00   Pending
4   2022-08-14T06:04:54.736Z    Pending     1053    2022-08-14 06:04:54.736000 00:00    unmatched
3   2022-08-24T01:46:31.115Z    Complete    1052    2022-08-24 01:46:31.115000 00:00    Complete
6   2022-08-24T06:24:54.736Z    Inprogress  999     2022-08-24 06:24:54.736000 00:00    unmatched

CodePudding user response：

You need to use merge:

df = df.merge(right=df1, on='id',how='left')