I have 2 dataframes df and df1 -
df-
|system_time|status|id|date|
|2022-03-04T07:52:26Z|Pending|772|2022-03-04 07:52:26 00:00|
|2022-06-22T17:52:42Z|Pending|963|2022-06-22 17:52:42 00:00|
|2022-08-13T01:34:44Z|Pending|1052|2022-08-13 01:34:44 00:00|
|2022-08-24T01:46:31.115Z|Complete|1052|2022-08-24 01:46:31.115000 00:00|
|2022-08-14T06:04:54.736Z|Pending|1053|2022-08-14 06:04:54.736000 00:00|
|2022-03-04T17:51:15.025Z|Pending|772|2022-03-04 17:51:15.025000 00:00|
|2022-08-24T06:24:54.736Z|Inprogress|999|2022-08-24 06:24:54.736000 00:00|
df1-
|id|task_status|
|1052|Complete|
|889|Pending|
|772|Complete|
|963|Pending|
Type of columns in df -
system_time - object
status - object
id - int64
date - object
I want to apply a lookup here from df into df1. If id matches in df and df1,status in df should be of task_status from df1. As there are duplicate records in df, need to get the latest record and update the status as of df1 else keep status same as df for unmatched id's. In df, I have converted the system_time into date column using -
df['date']=pd.to_datetime(df['system_time'])
Expected output -
|system_time|status|id|date|
|2022-06-22T17:52:42Z|Pending|963|2022-06-22 17:52:42 00:00|
|2022-08-24T01:46:31.115Z|Complete|1052|2022-08-24 01:46:31.115000 00:00|
|2022-08-14T06:04:54.736Z|Pending|1053|2022-08-14 06:04:54.736000 00:00|
|2022-03-04T17:51:15.025Z|Complete|772|2022-03-04 17:51:15.025000 00:00|
|2022-08-24T06:24:54.736Z|Inprogress|999|2022-08-24 06:24:54.736000 00:00|
CodePudding user response:
here is one way to do it using map
# map the df1 status to df when ID is found
df['status']=df['i'].map(df1.set_index(['id'])['task_status'])
df
system_time status id date
0 2022-03-04T07:52:26Z Complete 772 2022-03-04 07:52:26 00:00
1 2022-06-22T17:52:42Z Pending 963 2022-06-22 17:52:42 00:00
2 2022-08-13T01:34:44Z Complete 1052 2022-08-13 01:34:44 00:00
3 2022-08-24T01:46:31.115Z Complete 1052 2022-08-24 01:46:31.115000 00:00
4 2022-08-14T06:04:54.736Z NaN 1053 2022-08-14 06:04:54.736000 00:00
5 2022-03-04T17:51:15.025Z Complete 772 2022-03-04 17:51:15.025000 00:00
6 2022-08-24T06:24:54.736Z NaN 999 2022-08-24 06:24:54.736000 00:00
Alternately, if you like to update only when status is found in DF1
df['status']=df['status'].mask((df['id'].map(df1.set_index(['id'])['task_status']).notna()),
(df['id'].map(df1.set_index(['id'])['task_status'])) )
df
system_time status id date
0 2022-03-04T07:52:26Z Complete 772 2022-03-04 07:52:26 00:00
1 2022-06-22T17:52:42Z Pending 963 2022-06-22 17:52:42 00:00
2 2022-08-13T01:34:44Z Complete 1052 2022-08-13 01:34:44 00:00
3 2022-08-24T01:46:31.115Z Complete 1052 2022-08-24 01:46:31.115000 00:00
4 2022-08-14T06:04:54.736Z Pending 1053 2022-08-14 06:04:54.736000 00:00
5 2022-03-04T17:51:15.025Z Complete 772 2022-03-04 17:51:15.025000 00:00
6 2022-08-24T06:24:54.736Z Inprogress 999 2022-08-24 06:24:54.736000 00:00
for unmatched,
#update 'unmatched' column as unmatched when id is NOT found.
#When its found, keep the status as-is.
#you may want to keep the previous one and this one together
df['unmatched']=df['status'].mask((df['id'].map(df1.set_index(['id'])['task_status']).isna()),
'unmatched' )
df
system_time status id date unmatched
0 2022-03-04T07:52:26Z Complete 772 2022-03-04 07:52:26 00:00 Complete
1 2022-06-22T17:52:42Z Pending 963 2022-06-22 17:52:42 00:00 Pending
2 2022-08-13T01:34:44Z Complete 1052 2022-08-13 01:34:44 00:00 Complete
3 2022-08-24T01:46:31.115Z Complete 1052 2022-08-24 01:46:31.115000 00:00 Complete
4 2022-08-14T06:04:54.736Z Pending 1053 2022-08-14 06:04:54.736000 00:00 unmatched
5 2022-03-04T17:51:15.025Z Complete 772 2022-03-04 17:51:15.025000 00:00 Complete
6 2022-08-24T06:24:54.736Z Inprogress 999 2022-08-24 06:24:54.736000 00:00 unmatched
to keep the last row based on the system-time
df.sort_values('system_time').drop_duplicates(subset=['id'], keep='last')
system_time status id date unmatched
5 2022-03-04T17:51:15.025Z Pending 772 2022-03-04 17:51:15.025000 00:00 Pending
1 2022-06-22T17:52:42Z Pending 963 2022-06-22 17:52:42 00:00 Pending
4 2022-08-14T06:04:54.736Z Pending 1053 2022-08-14 06:04:54.736000 00:00 unmatched
3 2022-08-24T01:46:31.115Z Complete 1052 2022-08-24 01:46:31.115000 00:00 Complete
6 2022-08-24T06:24:54.736Z Inprogress 999 2022-08-24 06:24:54.736000 00:00 unmatched
CodePudding user response:
You need to use merge
:
df = df.merge(right=df1, on='id',how='left')