I'm working with two dataframes, one with abreviations of pharmaceutical forms, and another with the complete version of pharmaceutical forms. I want to check if a string with several substrings (as words) are all contained at the start of a word in another string.
I have:
df1
abrev
'dis ijp'
'dis inf'
'dis inj'
I'm trying to associate those abreviations to strings with the complete version of those pharmaceutical forms:
df2
term
'Dispergovateľná tableta'
'Dispergovateľné tablety do dávkovacieho zariadenia'
'Disperzia na koncentrát na infúznu disperziu'
'Disperzia pre rozprašovač'
I tried using fuzzywuzzy
but it rarely matches with the correct string because I have hundreds of them, so lowering the threshold will result in wrong matches. Most abreviations don't even have the right term in df2
to match with, as shown in the example.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def fuzzy_partial(df_1, df_2, key1, key2, threshold=90, limit=3):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit, scorer = fuzz.partial_ratio))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: '; '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
fuzzy_partial(df1,df2,'abrev','term',threshold=50)
This code sample uses the partial_ratio
scorer, but I tried with all scorers. That's why I thought of matching the substrings with the start of the words on the complete terms. This way, I would get:
df
abrev term
'dis inf' 'Disperzia na koncentrát na infúznu disperziu'
What would be the best way to do this?
CodePudding user response:
Concept:
Look if each word starts with each abbreviation. If yes then count it and compare the next words with the next abbreviation instead. If the final count equals to the number of abbreviation words then it means the term contains all of the abbreviations in them; so, put it in the result list. Do this for all of the terms and abbreviations. Lastly, return the result list to see all of the matched couples.
Code:
abrevs = ['dis ijp','dis inf','dis inj',]
terms = ['Dispergovateľná tableta',
'Dispergovateľné tablety do dávkovacieho zariadenia',
'Disperzia na koncentrát na infúznu disperziu',
'Disperzia pre rozprašovač',]
def find_all(abrevs, terms):
result = []
for abrev in abrevs:
abrev_match_count = 0
abrev_split = abrev.split(' ')
for term in terms:
for word in term.split(' '):
if abrev_match_count < len(abrev_split) and word.lower().startswith(abrev_split[abrev_match_count]):
abrev_match_count = 1
if abrev_match_count == len(abrev_split):
result.append((abrev,term))
return result
print(find_all(abrevs, terms))