I have 4 columns which are BuisnessID, Name, BuisnessID_y, Name_y and I want to match Name with Name_y with a 90% similarity score, and if not 90% then drop those rows. Sample input
df
BusinessID NAME BusinessID_y NAME_y
1013120869 MANOJ WANKHADE 1013404164 SLIMI
1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR
1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL
1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL
1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR
I am new to python and am not sure how to do this. Also, I have 500k records so any another approach other rapid-fuzz would be great
CodePudding user response:
>>> import pandas as pd
>>> import rapidfuzz
>>> df['matching_ratio'] = df.apply(lambda x:rapidfuzz.fuzz.ratio(x.NAME, x.NAME_y), axis=1).to_list()
>>> df
BusinessID NAME BusinessID_y NAME_y matching_ratio
0 1013120869 MANOJ WANKHADE 1013404164 SLIMI 10.526316
1 1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR 44.444444
2 1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL 25.806452
3 1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL 25.806452
4 1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR 38.709677
>>> df[df.matching_ratio > 26] # change this '26' value to '90' as your requirmetn
BusinessID NAME BusinessID_y NAME_y matching_ratio
1 1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR 44.444444
4 1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR 38.709677