I have two dataframes which I would like to append based on a regex. If the value in a 'code' column of df1 matches (eg. R93) with the 'ICD_CODE' of df2(eg. R93), append the 'code' column value to df2.
df1
code
R93.2
S03
df2
ICD_CODE ICD_term MDR_code MDR_term
R93.1 Acute abdomen 10000647 Acute abdomen
K62.4 Stenosis of anus and rectum 10002581 Anorectal stenosis
S03.1 Hand-Schüller-Christian disease 10053135 Hand-Schueller-Christian disease
The expected output is:
code ICD_CODE ICD_term MDR_code MDR_term
R93.2 R93.1 Acute abdomen 10000647 Acute abdomen
S03 S03.1 Hand-Schüller-Christian disease 10053135 Hand-Schueller-Christian disease
Any help is highly appreciated!
CodePudding user response:
Keep the left part (before dot) of each code columns as the merge
key:
out = (df1.merge(df2, left_on=df1['code'].str.split('.').str[0],
right_on=df2['ICD_CODE'].str.split('.').str[0])
.drop(columns='key_0'))
print(out)
# Output
code ICD_CODE ICD_term MDR_code MDR_term
0 R93.2 R93.1 Acute abdomen 10000647 Acute abdomen
1 S03 S03.1 Hand-Schüller-Christian disease 10053135 Hand-Schueller-Christian disease
CodePudding user response:
A possible solution would be to use process.extractOne
from fuzzywuzzy.
#pip install fuzzywuzzy
from fuzzywuzzy import process
out = (df1.assign(matched_code=df1["code"].apply(lambda x: process.extractOne(x, df2["ICD_CODE"])[0]))
.merge(df2, left_on="matched_code", right_on="ICD_CODE")
.drop(columns="matched_code")
)
Output :
print(out)
code ICD_CODE ICD_term MDR_code MDR_term
0 R93.2 R93.1 Acute abdomen 10000647 Acute abdomen
1 S03 S03.1 Hand-Schüller-Christian disease 10053135 Hand-Schueller-Christian disease None