I am trying to merge two dataframes of different sizes based on partial match of the column 'name' and 'sub_name' (and full match of values in column A):
sub_name val_1 A
0 AAA 2 208
1 AAB 4 208
2 AAC 8 208
3 BAA 7 210
4 CAA 4 213
5 CAC 6 213
6 CAD 2 213
7 CAE 3 213
8 EAA 8 222
9 FAA 3 223
name val_2 A
0 XAAA 1 208
1 AABYY 5 208
2 BAAZ 9 210
3 CAAY 5 213
4 YCABX 8 213
5 XXCAC 6 213
6 YCADZ 3 213
7 XDAA 6 215
8 EAAX 4 222
the codes:
df1 = pd.DataFrame({'sub_name': ['AAA','AAB','AAC','BAA','CAA','CAC','CAD','CAE','EAA', 'FAA'],
'val_1': [2,4,8,7,4,6,2,3,8,3],
'A':[208,208,208,210,213,213,213,213,222,223]})
df2 = pd.DataFrame({'name': ['XAAA','AABYY','BAAZ','CAAY','YCABX','XXCAC','YCADZ','XDAA','EAAX'],
'val_2': [1,5,9,5,8,6,3,6,4],
'A': [208,208,210,213,213,213,213,215,222]})
Edit: I want to do an outer merge of these two dataframes – if there is no match, keep the rows, if there is a partial match (between sub_name and name) and also if the values in column A match, merge them together. If there is a partial match between name and sub_name but the column A values don't match, keep both rows.
I am trying to obtain:
name val_1 val_2 A
0 AAA 2.0 1.0 208
1 AAB 4.0 5.0 208
2 AAC 8.0 NaN 208
3 BAA 7.0 9.0 210
4 CAA 4.0 5.0 213
5 YCABX NaN 8.0 213
6 CAC 6.0 6.0 213
7 CAD 2.0 3.0 213
8 CAE 3.0 NaN 213
9 xDAA NaN 6.0 215
10 EAA 8.0 4.0 222
11 FAA 3.0 NaN 223
or this (it doesn't matter if I keep the full name or just the sub_name where the rows match):
name val_1 val_2 A
0 XAAA 2.0 1.0 208
1 AABYY 4.0 5.0 208
2 AAC 8.0 NaN 208
3 BAAZ 7.0 9.0 210
4 CAAY 4.0 5.0 213
5 YCABX NaN 8.0 213
6 XXCAC 6.0 6.0 213
7 YCADZ 2.0 3.0 213
8 CAE 3.0 NaN 213
9 XDAA NaN 6.0 215
10 EAA 8.0 4.0 222
11 FAA 3.0 NaN 223
If I needed full match I would use .merge(df1, df2, how = 'outer')
but since I am working with substrings I don't know how to approach this. Maybe str.contains()
could be useful?
Note: The sub_name can be made of more than just three letters. This is just an example.
CodePudding user response:
You can use a fuzzy match with a threshold, then merge
:
from thefuzz import process
def best(x, thresh=80):
match, score = process.extractOne(x, df2['name'])
if score>=thresh:
return match
df1.merge(df2, left_on=['A', df1['sub_name'].apply(best)],
right_on=['A', 'name'],
how='outer')
Output:
sub_name val_1 A name val_2
0 AAA 2.0 208 XAAA 1.0
1 AAB 4.0 208 AABYY 5.0
2 AAC 8.0 208 None NaN
3 BAA 7.0 210 BAAZ 9.0
4 CAA 4.0 213 CAAY 5.0
5 CAC 6.0 213 XXCAC 6.0
6 CAD 2.0 213 YCADZ 3.0
7 CAE 3.0 213 None NaN
8 EAA 8.0 222 EAAX 4.0
9 FAA 3.0 223 None NaN
10 NaN NaN 213 YCABX 8.0
11 NaN NaN 215 XDAA 6.0
CodePudding user response:
# Create new column, an extraction of df's subname in df2's name
df2['sub_name']=df2['name'].str.findall('|'.join(df1['sub_name'].values.tolist())).str.join(',')
#Do an outer merge
df_new=df2.merge(df1, how='outer', on=['sub_name', 'A'])
#Update the name's columns values and drop the sub_name column created above
df_new=df_new.assign(name=df_new['name'].combine_first(df_new['sub_name'])).drop(columns=['sub_name'])
outcome
name val_2 A val_1
0 XAAA 1.0 208 2.0
1 AABYY 5.0 208 4.0
2 BAAZ 9.0 210 7.0
3 CAAY 5.0 213 4.0
4 YCABX 8.0 213 NaN
5 XXCAC 6.0 213 6.0
6 YCADZ 3.0 213 2.0
7 XDAA 6.0 215 NaN
8 EAAX 4.0 222 8.0
9 AAC NaN 208 8.0
10 CAE NaN 213 3.0
11 FAA NaN 223 3.0