Excellent answers exist already that make use of exact string matches:
String contains across two pandas series
But how might I use substring matching with str.contains
leveraging another series in my data frame?
Consider this example -- how can I achieve the expected result or at least make use of substring matching, str.contains?
df = pd.DataFrame({'LName': ['AT-D81CGG3-RED-STORAGE-GEN4.2', 'AT-Z111AC5-RED-STORAGEUTILITY-GEN4.2'],
'RName': ['AT_Utility_Gen4_2|AT_Storage_Gen4_2', 'ATAT_Utility_Gen4_2|AT_Storage_Gen4_2'],
'Expected': ['AT_Storage_Gen4_2', 'AT_Utility_Gen4_2']})
# exact search term matching which is not helpful
df['RName'].str.split('|', expand=True).eq(df['LName'], axis=0).any(1).astype(np.int8)
CodePudding user response:
You can use Levenshtein distance as fuzzy behavior:
!pip install python-Levenshtein
import Levenshtein as lev
def get_fuzzy_match(row):
rnames = row["RName"].split('|')
tuples = list(itertools.product([row["LName"]], rnames))
min_idx = np.argmin(list(itertools.starmap(lev.distance, tuples)))
return rnames[min_idx]
df.apply(lambda row: get_fuzzy_match(row), axis=1)
>> 0 AT_Storage_Gen4_2
>> 1 ATAT_Utility_Gen4_2