String literal matching between words in two different dataframe (dfs) and generate a new dataframe-CodePudding

I have two dataframes df1 and df2

df1 =

University	School	Student first name	last name	nick name
AAA	Law	John	Mckenzie	Stevie
BBB	Business	Steve	Savannah	JO
CCC	Engineering	Mark	Justice	Fre
DDD	Arts	Stuart	Little	Rah
EEE	Life science	Adam	Johnson	meh

120 rows X 5 columns

df2 =

Statement
Stuart had a headache last nigh which was due to th……
Rah basically found a new found friend which lead to the……
Gerome got a brand new watch which was……….
Adam was found chilling all through out his life……
Savannah is such a common name that……..

3000 rows X1 columns

AIM is to form df3

Match the string literal and iterate it through every cells in the columns "Student first name" , "Student last name" , "Student nick name" to produce the table below

Df3 =

Statement	Matching	University	School
Stuart had a headache last nigh which was due to th…	Stuart	DDD	Arts
Rah basically found a new found friend which lead to	Rah	DDD	Arts
Gerome got a brand new watch which was……….	NA	NA	NA
Adam was found chilling all through out his life……	Adam	EEE	Life science
Savannah is such a common name that……..	Savannah	BBB	Business

3000 rows X 4 columns

CodePudding user response：

Naïve approach, loop columns to find matches then loop to merge on matches:

import re


columns_to_match = ["Student first name", "last name", "nick name"]

dfs = []
for column in columns_to_match:
    search_strings = df1[column].unique().tolist()

    regex = "|".join(map(re.escape, search_strings))
    df2["Matching"] = df2["Statement"].str.extract(f"({regex})")
    dfs.append(df2.dropna())

matched_df = pd.concat(dfs).reset_index(drop=True)


dfs = []
for column in columns_to_match:
    final_df = df1.merge(matched_df, how="inner", left_on=column, right_on="Matching")
    dfs.append(final_df)

final_df = pd.concat(dfs).reset_index(drop=True).drop(columns=columns_to_match)

CodePudding user response：

My answer makes the following assumptions:

The index on df1 serves as the student ID and is unique.
That you only want to fill the first student found. A statement like "John and Steve are friends" will be assigned to John.

import re

assigned = pd.Series([False] * len(df2))
df3 = df2.copy()

# Loop through each student, taking their first, last and nick name
for idx, names in df1[["Student first name", "last name", "nick name"]].iterrows():
    # If all statements have been assigned, terminate the loop
    if assigned.all():
        break

    # Combine the student's first, last and nick name into a regex pattern
    pattern = f"({'|'.join(names.map(re.escape))})"

    # For each UNASSIGNED statement, Find the pattern. We only search unassigned
    # statements to lower the number of searches.
    match = df3.loc[~assigned, "Statement"].str.extract(pattern, expand=False)

    # Mark the statement as assigned
    cond = ~assigned & match.notna()
    assigned[cond] = True

    # Fill in the student's info
    df3.loc[cond, "Match"] = match[cond]
    df3.loc[cond, "University"] = df1.loc[idx, "University"]
    df3.loc[cond, "School"] = df1.loc[idx, "School"]

CodePudding user response：

Rather than iterating through each cell, you could create three dataframes (merging with all three columns separately) and concatenate the results into one dataframe.

df2['Matching'] = df2['Statement'].str.split().str[0]

dfs = []
for col in ['Student first name', 'last name', 'nick name']:
    df_temp = pd.merge(df2, df1[[col, 'University', 'School']].rename(columns={col:'Matching'}), how='left')
    dfs.append(df_temp)

df3 = pd.concat(dfs).drop_duplicates()