Home > Software design >  Pandas: str contains using another series
Pandas: str contains using another series

Time:10-16

Excellent answers exist already that make use of exact string matches:

String contains across two pandas series

But how might I use substring matching with str.contains leveraging another series in my data frame?

Consider this example -- how can I achieve the expected result or at least make use of substring matching, str.contains?

df = pd.DataFrame({'LName': ['AT-D81CGG3-RED-STORAGE-GEN4.2', 'AT-Z111AC5-RED-STORAGEUTILITY-GEN4.2'],
                   'RName': ['AT_Utility_Gen4_2|AT_Storage_Gen4_2', 'ATAT_Utility_Gen4_2|AT_Storage_Gen4_2'],
                  'Expected': ['AT_Storage_Gen4_2', 'AT_Utility_Gen4_2']})
# exact search term matching which is not helpful
df['RName'].str.split('|', expand=True).eq(df['LName'], axis=0).any(1).astype(np.int8)

CodePudding user response:

You can use Levenshtein distance as fuzzy behavior:

!pip install python-Levenshtein

import Levenshtein as lev

def get_fuzzy_match(row):
  rnames = row["RName"].split('|')
  tuples = list(itertools.product([row["LName"]], rnames))
  min_idx = np.argmin(list(itertools.starmap(lev.distance, tuples)))
  return rnames[min_idx]

df.apply(lambda row: get_fuzzy_match(row), axis=1)

>> 0      AT_Storage_Gen4_2
>> 1    ATAT_Utility_Gen4_2
  • Related