Home > front end >  String literal matching between words in two different dataframe (dfs) and generate a new dataframe
String literal matching between words in two different dataframe (dfs) and generate a new dataframe


I have two dataframes df1 and df2

df1 =

University School Student first name last name nick name
AAA Law John Mckenzie Stevie
BBB Business Steve Savannah JO
CCC Engineering Mark Justice Fre
DDD Arts Stuart Little Rah
EEE Life science Adam Johnson meh

120 rows X 5 columns

df2 =

Stuart had a headache last nigh which was due to th……
Rah basically found a new found friend which lead to the……
Gerome got a brand new watch which was……….
Adam was found chilling all through out his life……
Savannah is such a common name that……..

3000 rows X1 columns

AIM is to form df3

Match the string literal and iterate it through every cells in the columns "Student first name" , "Student last name" , "Student nick name" to produce the table below

Df3 =

Statement Matching University School
Stuart had a headache last nigh which was due to th… Stuart DDD Arts
Rah basically found a new found friend which lead to Rah DDD Arts
Gerome got a brand new watch which was………. NA NA NA
Adam was found chilling all through out his life…… Adam EEE Life science
Savannah is such a common name that…….. Savannah BBB Business

3000 rows X 4 columns

CodePudding user response:

Naïve approach, loop columns to find matches then loop to merge on matches:

import re

columns_to_match = ["Student first name", "last name", "nick name"]

dfs = []
for column in columns_to_match:
    search_strings = df1[column].unique().tolist()

    regex = "|".join(map(re.escape, search_strings))
    df2["Matching"] = df2["Statement"].str.extract(f"({regex})")

matched_df = pd.concat(dfs).reset_index(drop=True)

dfs = []
for column in columns_to_match:
    final_df = df1.merge(matched_df, how="inner", left_on=column, right_on="Matching")

final_df = pd.concat(dfs).reset_index(drop=True).drop(columns=columns_to_match)

CodePudding user response:

My answer makes the following assumptions:

  • The index on df1 serves as the student ID and is unique.
  • That you only want to fill the first student found. A statement like "John and Steve are friends" will be assigned to John.
import re

assigned = pd.Series([False] * len(df2))
df3 = df2.copy()

# Loop through each student, taking their first, last and nick name
for idx, names in df1[["Student first name", "last name", "nick name"]].iterrows():
    # If all statements have been assigned, terminate the loop
    if assigned.all():

    # Combine the student's first, last and nick name into a regex pattern
    pattern = f"({'|'.join(names.map(re.escape))})"

    # For each UNASSIGNED statement, Find the pattern. We only search unassigned
    # statements to lower the number of searches.
    match = df3.loc[~assigned, "Statement"].str.extract(pattern, expand=False)

    # Mark the statement as assigned
    cond = ~assigned & match.notna()
    assigned[cond] = True

    # Fill in the student's info
    df3.loc[cond, "Match"] = match[cond]
    df3.loc[cond, "University"] = df1.loc[idx, "University"]
    df3.loc[cond, "School"] = df1.loc[idx, "School"]

CodePudding user response:

Rather than iterating through each cell, you could create three dataframes (merging with all three columns separately) and concatenate the results into one dataframe.

df2['Matching'] = df2['Statement'].str.split().str[0]

dfs = []
for col in ['Student first name', 'last name', 'nick name']:
    df_temp = pd.merge(df2, df1[[col, 'University', 'School']].rename(columns={col:'Matching'}), how='left')

df3 = pd.concat(dfs).drop_duplicates()
  • Related