Merging DataFrames with Partial String Matches-CodePudding

I am trying to merge two dataframes of different sizes based on partial match of the column 'name' and 'sub_name' (and full match of values in column A):

  sub_name  val_1    A
0      AAA      2  208
1      AAB      4  208
2      AAC      8  208
3      BAA      7  210
4      CAA      4  213
5      CAC      6  213
6      CAD      2  213
7      CAE      3  213
8      EAA      8  222
9      FAA      3  223

    name  val_2    A
0   XAAA      1  208
1  AABYY      5  208
2   BAAZ      9  210
3   CAAY      5  213
4  YCABX      8  213
5  XXCAC      6  213
6  YCADZ      3  213
7   XDAA      6  215
8   EAAX      4  222

the codes:

df1 = pd.DataFrame({'sub_name': ['AAA','AAB','AAC','BAA','CAA','CAC','CAD','CAE','EAA', 'FAA'], 
'val_1': [2,4,8,7,4,6,2,3,8,3], 
'A':[208,208,208,210,213,213,213,213,222,223]})

df2 = pd.DataFrame({'name': ['XAAA','AABYY','BAAZ','CAAY','YCABX','XXCAC','YCADZ','XDAA','EAAX'], 
'val_2': [1,5,9,5,8,6,3,6,4], 
'A': [208,208,210,213,213,213,213,215,222]})

Edit: I want to do an outer merge of these two dataframes – if there is no match, keep the rows, if there is a partial match (between sub_name and name) and also if the values in column A match, merge them together. If there is a partial match between name and sub_name but the column A values don't match, keep both rows.

I am trying to obtain:

     name  val_1  val_2    A
0     AAA    2.0    1.0  208
1     AAB    4.0    5.0  208
2     AAC    8.0    NaN  208
3     BAA    7.0    9.0  210
4     CAA    4.0    5.0  213
5   YCABX    NaN    8.0  213
6     CAC    6.0    6.0  213
7     CAD    2.0    3.0  213
8     CAE    3.0    NaN  213
9    xDAA    NaN    6.0  215
10    EAA    8.0    4.0  222
11    FAA    3.0    NaN  223

or this (it doesn't matter if I keep the full name or just the sub_name where the rows match):

     name  val_1  val_2    A
0    XAAA    2.0    1.0  208
1   AABYY    4.0    5.0  208
2     AAC    8.0    NaN  208
3    BAAZ    7.0    9.0  210
4    CAAY    4.0    5.0  213
5   YCABX    NaN    8.0  213
6   XXCAC    6.0    6.0  213
7   YCADZ    2.0    3.0  213
8     CAE    3.0    NaN  213
9    XDAA    NaN    6.0  215
10    EAA    8.0    4.0  222
11    FAA    3.0    NaN  223

If I needed full match I would use .merge(df1, df2, how = 'outer') but since I am working with substrings I don't know how to approach this. Maybe str.contains() could be useful?

Note: The sub_name can be made of more than just three letters. This is just an example.

CodePudding user response：

You can use a fuzzy match with a threshold, then merge:

from thefuzz import process

def best(x, thresh=80):
    match, score = process.extractOne(x, df2['name'])
    if score>=thresh:
        return match


df1.merge(df2, left_on=['A', df1['sub_name'].apply(best)],
          right_on=['A', 'name'],
          how='outer')

Output:

   sub_name  val_1    A   name  val_2
0       AAA    2.0  208   XAAA    1.0
1       AAB    4.0  208  AABYY    5.0
2       AAC    8.0  208   None    NaN
3       BAA    7.0  210   BAAZ    9.0
4       CAA    4.0  213   CAAY    5.0
5       CAC    6.0  213  XXCAC    6.0
6       CAD    2.0  213  YCADZ    3.0
7       CAE    3.0  213   None    NaN
8       EAA    8.0  222   EAAX    4.0
9       FAA    3.0  223   None    NaN
10      NaN    NaN  213  YCABX    8.0
11      NaN    NaN  215   XDAA    6.0

CodePudding user response：

# Create new column, an extraction of df's subname in df2's name

df2['sub_name']=df2['name'].str.findall('|'.join(df1['sub_name'].values.tolist())).str.join(',')

#Do an outer merge

df_new=df2.merge(df1, how='outer', on=['sub_name', 'A'])

#Update the name's columns values and drop the sub_name column created above

df_new=df_new.assign(name=df_new['name'].combine_first(df_new['sub_name'])).drop(columns=['sub_name'])

outcome

    name  val_2    A  val_1
0    XAAA    1.0  208    2.0
1   AABYY    5.0  208    4.0
2    BAAZ    9.0  210    7.0
3    CAAY    5.0  213    4.0
4   YCABX    8.0  213    NaN
5   XXCAC    6.0  213    6.0
6   YCADZ    3.0  213    2.0
7    XDAA    6.0  215    NaN
8    EAAX    4.0  222    8.0
9     AAC    NaN  208    8.0
10    CAE    NaN  213    3.0
11    FAA    NaN  223    3.0