I have 2 DFs that I want to merge. but I need to merge them based on a string contains and also using multiple columns
df_1
IN Start_Time Description Per_Extr
0 IN7305517 2022-07-24 00:06:59 ABEND JOB PP_BRAI_VAR_CARTAO_IND_IBI_D and JOB_STREAM_NAME P26_BRAI_RS2... FROM : 2022/01/08 TO : 2022/12/09
1 IN7305465 2022-07-24 00:09:49 ABEND JOB PP_AAAR_4898_POUP_MOV_TDCH_D and JOB_STREAM_NAME P26_AAAR_006_TSA... FROM : 2022/01/08 TO : 2022/12/09
2 IN7305466 2022-07-24 00:10:16 ABEND JOB PP_AAAR_4898_POUPMOV_D and JOB_STREAM_NAME P26_AAAR_006_TSA... FROM : 2022/01/08 TO : 2022/12/09
3 IN7305493 2022-07-24 00:20:27 ABEND JOB PP_BGDTPRODHBACMS102020_01_M and JOB_STREAM_NAME P26_BGDTDCHF_PUM... FROM : 2022/01/08 TO : 2022/12/09
df_2
JOB_STREAM_NAME JOB_NAME
NaN P26_BRAI_RS2 PP_BRAI_VAR_CARTAO_IND_IBI_D
NaN P26_BRAI_VAR_TOD PP_BRAI_VAR_CARTAO_IND_IBI_D
NaN P26_AAAR_006_TSA PP_AAAR_4898_POUP_MOV_TDCH_D
NaN P26_AAAR_006_TSA PP_AAAR_4898_POUPMOV_D
NaN P26_BGDTDCHF_PUM PP_BGDTPRODHBACMS102020_01_M
The description column has the JOB_NAME and JOB_STREAM_NAME in it
What I'm aiming is a df like this: merged_df
IN JOB_STREAM_NAME JOB_NAME Start_Time Description Per_Extr
0 IN7305517 P26_BRAI_RS2 PP_BRAI_VAR_CARTAO_IND_IBI_D 2022-07-24 00:06:59 ABEND JOB PP_BRAI_VAR_CARTAO_IND_IBI_D and JOB_STREAM_NAME P26_BRAI_RS2... FROM : 2022/01/08 TO : 2022/12/09
1 NaN P26_BRAI_VAR_TOD PP_BRAI_VAR_CARTAO_IND_IBI_D NaN NaN NaN
2 IN7305465 P26_AAAR_006_TSA PP_AAAR_4898_POUP_MOV_TDCH_D 2022-07-24 00:10:16 ABEND JOB PP_AAAR_4898_POUPMOV_D and JOB_STREAM_NAME P26_AAAR_006_TSA... FROM : 2022/01/08 TO : 2022/12/09
3 IN7305466 P26_AAAR_006_TSA PP_AAAR_4898_POUPMOV_D 2022-07-24 00:10:16 ABEND JOB PP_AAAR_4898_POUPMOV_D and JOB_STREAM_NAME P26_AAAR_006_TSA... FROM : 2022/01/08 TO : 2022/12/09
4 IN7305493 P26_AAAR_006_TSA PP_AAAR_4898_POUPMOV_D 2022-07-24 00:20:27 ABEND JOB PP_BGDTPRODHBACMS102020_01_M and JOB_STREAM_NAME P26_BGDTDCHF_PUM... FROM : 2022/01/08 TO : 2022/12/09
Notice that the job PP_BRAI_VAR_CARTAO_IND_IBI_D is in 2 JOB_STREAM_NAME and has no INs for one of them, that's why in the merged_df it appears without IN(NaN) for the one in the JOB_STREAM_NAME = P26_BRAI_VAR_TOD
I was instructed to do that with one column, but, not managing doing the same for multiple columns.
For one column I'm using this approach:
jobs_list= "|".join(map(str, df_2['JOB_NAME']))
new_df.insert(0, 'merge_key', df_1['Description'].str.extract("(" jobs_list ")", expand=False))
df_merged = new_df.merge(df_1, how='right', left_on='merge_key', right_on='JOB_NAME').drop('merge_key', axis=1)
could you guys help me?
CodePudding user response:
you would need a key to merge the two, so we extract the keys and use them to merge.
# extract the keys from the description and create addl columns
# you can always drop these afterwards
df[['JOB_NAME', 'JOB_STREAM_NAME' ]]=df['Description'].str.extract(r'JOB\s\b(\w )\b.*?JOB_STREAM_NAME\s\b(\w )\b' )
#merge on steam_name and job_name, since columns names are common, these won't be repeated
df3=df2.merge(df, on=['JOB_STREAM_NAME','JOB_NAME'], how='left')
df3
# drop the addl columns
df=df.drop(columns=['JOB_STREAM_NAME','JOB_NAME'])
JOB_STREAM_NAME JOB_NAME IN Start_Time Description Per_Extr
0 P26_BRAI_RS2 PP_BRAI_VAR_CARTAO_IND_IBI_D IN7305517 2022-07-24 00:06:59 ABEND JOB PP_BRAI_VAR_CARTAO_IND_IBI_D and JOB... FROM : 2022/01/08 TO : 2022/12/09
1 P26_BRAI_VAR_TOD PP_BRAI_VAR_CARTAO_IND_IBI_D NaN NaN NaN NaN
2 P26_AAAR_006_TSA PP_AAAR_4898_POUP_MOV_TDCH_D IN7305465 2022-07-24 00:09:49 ABEND JOB PP_AAAR_4898_POUP_MOV_TDCH_D and JOB... FROM : 2022/01/08 TO : 2022/12/09
3 P26_AAAR_006_TSA PP_AAAR_4898_POUPMOV_D IN7305466 2022-07-24 00:10:16 ABEND JOB PP_AAAR_4898_POUPMOV_D and JOB_STREA... FROM : 2022/01/08 TO : 2022/12/09
4 P26_BGDTDCHF_PUM PP_BGDTPRODHBACMS102020_01_M IN7305493 2022-07-24 00:20:27 ABEND JOB PP_BGDTPRODHBACMS102020_01_M and JOB... FROM : 2022/01/08 TO : 2022/12/09
(r'JOB\s : match the literal JOB followed by \s (whitespace)
\b : word boundary
(\w )\b : capture one or more letters followed by word boundary (that will be your jobid)
.*? : match one or letters (non greedy)
JOB_STREAM_NAME\s\b : match the literal followed by whitespace, followed by word boundary
(\w )\b : capture one or more word characters followed by word boundary
' )