I'm attempting to use two dataframes, one as a lookup table to find a substring match on the value in my datasets dataframes column. After I find the value, I'd like to create a new column with that value and iterate through the entire column and remove the matched substring from the initial column and loop through until there are no more matches.
The first problem I'm having is I'm not able to match or return the value of the match unless it's an exact string. The tricky part is sometimes the Ingredient_Name contains multiple words for a single ingredient.
This is a smaller sample of my code, the commented sections include the error or the problem with what I tried:
import pandas as pd
singleingredientdata = {
'Ingredient_Name':['ACEBUTOLOL','ACETAMINOPHEN','ACETYLSALICYLIC ACID','CAFFEINE','COLISTIN','HYDROCORTISONE','NEOMYCIN','THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[1,1,2,1,1,1,1,2,1],
'Num_Of_Ingredients':[1,1,1,1,1,1,1,1,1]
}
multiingredientdata = {
'Ingredient_Name':['ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE','ACEBUTOLOL ACETYLSALICYLIC ACID','COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[4,3,5,1],
'Num_Of_Ingredients':[3,2,4,1]
}
df1 = pd.DataFrame(data=singleingredientdata)
df2 = pd.DataFrame(data=multiingredientdata)
ingredientcount = df2["Num_Of_Ingredients"]
max_value = ingredientcount.max()
df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name']) ##Doesn't flag True unless it finds a single igredient exists in the string
##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
#df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]] ## passing 4 items instead of a single pass being implied??
##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any()) TypeError: first argument must be string or compiled pattern
iterator = 1
for j in range(0,max_value):
col_name = 'Ingredient_Name' str(iterator)
# contain_values = df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
# df2[col_name]= df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
iterator = 1
print(df2)
Ideally my results would look like this:
Ingredient_Name Ingredient_Name1 Igredient_Name2 Ingredient_Name3 Ingredient_Name4
ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE
ACEBUTOLOL ACETYLSALICYLIC ACID
COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE
BROMIDE
The original Ingredient_Name would contain any values that were not found in the lookup, in this example there are none.
What I've attempted to get the match on the ingredients so far is the following I've included the error messages and the issue with that line of code:
df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name']) ##Doesn't flag True unless it finds a single igredient exists in the string
##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
#df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]] ## passing 4 items instead of a single pass being implied??
##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any()) TypeError: first argument must be string or compiled pattern
The part I'm able to match on the exact string returns the following results, but I'd like to return the value instead of true/fase and match on the substring not the exact match:
Ingredient_Name WordCount Num_Of_Ingredients Exists
0 ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE 4 3 False
1 ACEBUTOLOL ACETYLSALICYLIC ACID 3 2 False
2 COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BR... 5 4 False
3 BROMIDE 1 1 True
Perhaps I'm going about this problem the wrong way, or maybe I'm close, but I'm not grasping something. Any help that you can offer to lead me in the right direction, I appreciate it!
CodePudding user response:
In order to maintain a column of unmatched ingredients, the best I could come up with was this. If unmatched ingredients aren't as important you're better off using the other inbuilt functions for string and pattern matching mentioned in the other answers. This is probably not the most efficient way to do this.
def match_ingredients(row, df):
base_str = row['Ingredient_Name']
result_count = 1
result = {}
for idx, ingredient in df.iterrows():
if ingredient['Ingredient_Name'] in base_str:
result[f'Ingredient_{result_count}'] = ingredient['Ingredient_Name']
result_count = 1
base_str = base_str.replace(ingredient['Ingredient_Name'], "")
result['Ingredient_Name'] = base_str
return result
result = df2.apply(match_ingredients,axis=1, result_type='expand', args=(df1,))
df2.apply(match_ingredients)
Executes the function over each row of df2
and combines the row type response of the function into another dataframe.
It takes df1
as a a parameter so that we can iterate over every ingredient (This can be modified to a list of ingredients as well) and in
can be used as a substring check in native Python. If the string is inside the total ingredient list then we use replace
to "subtract" it from the total list of ingredients.
The other thing here is that the dictionary returned will have its keys treated as column names so we can assign the remaining base string (after replacing all matching strings) to the constant column name Ingredient Name
.
result_type = 'expand'
implies that the response of the function is to be turned into multiple columns if possible.
Docs for apply.
CodePudding user response:
I don't fully understand what you really want but maybe this could help you?
pattern = '|'.join(df1['Ingredient_Name'].tolist())
out = df2['Ingredient_Name'].str.findall(pattern).apply(pd.Series)
out.columns = 'Ingredient_Name_' (out.columns 1).astype(str)
out = df2.join(out)
print(out)
# Output:
Ingredient_Name WordCount Num_Of_Ingredients \
0 ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE 4 3
1 ACEBUTOLOL ACETYLSALICYLIC ACID 3 2
2 COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE 5 4
3 BROMIDE 1 1
Ingredient_Name_1 Ingredient_Name_2 Ingredient_Name_3 Ingredient_Name_4
0 ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE NaN
1 ACEBUTOLOL ACETYLSALICYLIC ACID NaN NaN
2 COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE
3 BROMIDE NaN NaN NaN
CodePudding user response:
- Use
str.extractall
to get all matches unstack
to convert to individual columns
output = df2['Ingredient_Name'].str.extractall(f"({'|'.join(df1['Ingredient_Name'])})").unstack()
#formatting
output = output.droplevel(0,1).rename_axis(None, axis=1).add_prefix("Ingredient_Name_")
>>> output
Ingredient_Name_0 Ingredient_Name_1 Ingredient_Name_2 Ingredient_Name_3
0 ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE NaN
1 ACEBUTOLOL ACETYLSALICYLIC ACID NaN NaN
2 COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE
3 BROMIDE NaN NaN NaN