Home > other >  How to efficiently find a string that is contained in another string from a different group in panda
How to efficiently find a string that is contained in another string from a different group in panda

Time:12-07

given the following table sample:

identifier matched_string
occupation manager
occupation manager
skill manager
department marketing manager
skill marketing

I would like to find the cases where a matched_string is contained in another 'matched_string' and have a different identifier. In the end I would like to get a dataframe that lists these matches.

I have this code so far but I feel it is super inefficient:

#df5 is the dataframe I want to search through

df4 = pd.DataFrame()

#find overlaps from matched_strings -> where matched string is contained in another matched string and has a different identifier
for index, row in df5.iterrows():
    for index2, row2 in df5.iterrows():
        if row["matched_string"] in row2["matched_string"] and row["identifier"] != row2["identifier"]:
            df4 = df4.append(row1)
            df4 = df4.append(row2)      

Would someone have an idea on how to make this more efficient?

CodePudding user response:

Wouldn't an aggregation as set per matched_string work for you?

df.groupby('matched_string')['identifier'].agg(set)

Output:

matched_string
manager              {occupation, skill}
marketing                        {skill}
marketing manager           {department}
Name: identifier, dtype: object

If you want to use it to filter your DataFrame and keep only the rows with multiple identifiers per matched_string:

s = df.groupby('matched_string')['identifier'].agg(set)

idx = s[s.str.len().gt(1)].index

out = df.loc[df['matched_string'].isin(idx)]

Output:

   identifier matched_string
0  occupation        manager
1  occupation        manager
2       skill        manager
  • Related