I have two dataframes df1 and df2
df1 =
University | School | Student first name | last name | nick name |
---|---|---|---|---|
AAA | Law | John | Mckenzie | Stevie |
BBB | Business | Steve | Savannah | JO |
CCC | Engineering | Mark | Justice | Fre |
DDD | Arts | Stuart | Little | Rah |
EEE | Life science | Adam | Johnson | meh |
120 rows X 5 columns
df2 =
Statement |
---|
Stuart had a headache last nigh which was due to th…… |
Rah basically found a new found friend which lead to the…… |
Gerome got a brand new watch which was………. |
Adam was found chilling all through out his life…… |
Savannah is such a common name that…….. |
3000 rows X1 columns
AIM is to form df3
Match the string literal and iterate it through every cells in the columns "Student first name" , "Student last name" , "Student nick name" to produce the table below
Df3 =
Statement | Matching | University | School |
---|---|---|---|
Stuart had a headache last nigh which was due to th… | Stuart | DDD | Arts |
Rah basically found a new found friend which lead to | Rah | DDD | Arts |
Gerome got a brand new watch which was………. | NA | NA | NA |
Adam was found chilling all through out his life…… | Adam | EEE | Life science |
Savannah is such a common name that…….. | Savannah | BBB | Business |
3000 rows X 4 columns
CodePudding user response:
Naïve approach, loop columns to find matches then loop to merge on matches:
import re
columns_to_match = ["Student first name", "last name", "nick name"]
dfs = []
for column in columns_to_match:
search_strings = df1[column].unique().tolist()
regex = "|".join(map(re.escape, search_strings))
df2["Matching"] = df2["Statement"].str.extract(f"({regex})")
dfs.append(df2.dropna())
matched_df = pd.concat(dfs).reset_index(drop=True)
dfs = []
for column in columns_to_match:
final_df = df1.merge(matched_df, how="inner", left_on=column, right_on="Matching")
dfs.append(final_df)
final_df = pd.concat(dfs).reset_index(drop=True).drop(columns=columns_to_match)
CodePudding user response:
My answer makes the following assumptions:
- The index on
df1
serves as the student ID and is unique. - That you only want to fill the first student found. A statement like "John and Steve are friends" will be assigned to John.
import re
assigned = pd.Series([False] * len(df2))
df3 = df2.copy()
# Loop through each student, taking their first, last and nick name
for idx, names in df1[["Student first name", "last name", "nick name"]].iterrows():
# If all statements have been assigned, terminate the loop
if assigned.all():
break
# Combine the student's first, last and nick name into a regex pattern
pattern = f"({'|'.join(names.map(re.escape))})"
# For each UNASSIGNED statement, Find the pattern. We only search unassigned
# statements to lower the number of searches.
match = df3.loc[~assigned, "Statement"].str.extract(pattern, expand=False)
# Mark the statement as assigned
cond = ~assigned & match.notna()
assigned[cond] = True
# Fill in the student's info
df3.loc[cond, "Match"] = match[cond]
df3.loc[cond, "University"] = df1.loc[idx, "University"]
df3.loc[cond, "School"] = df1.loc[idx, "School"]
CodePudding user response:
Rather than iterating through each cell, you could create three dataframes (merging with all three columns separately) and concatenate the results into one dataframe.
df2['Matching'] = df2['Statement'].str.split().str[0]
dfs = []
for col in ['Student first name', 'last name', 'nick name']:
df_temp = pd.merge(df2, df1[[col, 'University', 'School']].rename(columns={col:'Matching'}), how='left')
dfs.append(df_temp)
df3 = pd.concat(dfs).drop_duplicates()