Append columns of pandas dataframe based on a regex-CodePudding

I have two dataframes which I would like to append based on a regex. If the value in a 'code' column of df1 matches (eg. R93) with the 'ICD_CODE' of df2(eg. R93), append the 'code' column value to df2.

df1
code
R93.2
S03


df2
ICD_CODE    ICD_term                        MDR_code    MDR_term    
R93.1       Acute abdomen                   10000647    Acute abdomen   
K62.4       Stenosis of anus and rectum     10002581    Anorectal stenosis
S03.1       Hand-Schüller-Christian disease 10053135    Hand-Schueller-Christian disease

The expected output is:

code    ICD_CODE    ICD_term                        MDR_code    MDR_term    
R93.2   R93.1       Acute abdomen                   10000647    Acute abdomen   
S03     S03.1       Hand-Schüller-Christian disease 10053135    Hand-Schueller-Christian disease

Any help is highly appreciated!

CodePudding user response：

Keep the left part (before dot) of each code columns as the merge key:

out = (df1.merge(df2, left_on=df1['code'].str.split('.').str[0], 
                right_on=df2['ICD_CODE'].str.split('.').str[0])
          .drop(columns='key_0'))
print(out)

# Output
    code ICD_CODE                         ICD_term  MDR_code                          MDR_term
0  R93.2    R93.1                    Acute abdomen  10000647                     Acute abdomen
1    S03    S03.1  Hand-Schüller-Christian disease  10053135  Hand-Schueller-Christian disease

CodePudding user response：

A possible solution would be to use process.extractOne from fuzzywuzzy.

#pip install fuzzywuzzy
from fuzzywuzzy import process

out = (df1.assign(matched_code=df1["code"].apply(lambda x: process.extractOne(x, df2["ICD_CODE"])[0]))
          .merge(df2, left_on="matched_code", right_on="ICD_CODE")
          .drop(columns="matched_code")
       )

Output :

print(out)

    code ICD_CODE                                  ICD_term                          MDR_code       MDR_term
0  R93.2    R93.1                             Acute abdomen                          10000647  Acute abdomen
1    S03    S03.1  Hand-Schüller-Christian disease 10053135  Hand-Schueller-Christian disease           None