given the following table sample:
identifier | matched_string |
---|---|
occupation | manager |
occupation | manager |
skill | manager |
department | marketing manager |
skill | marketing |
I would like to find the cases where a matched_string
is contained in another 'matched_string' and have a different identifier.
In the end I would like to get a dataframe that lists these matches.
I have this code so far but I feel it is super inefficient:
#df5 is the dataframe I want to search through
df4 = pd.DataFrame()
#find overlaps from matched_strings -> where matched string is contained in another matched string and has a different identifier
for index, row in df5.iterrows():
for index2, row2 in df5.iterrows():
if row["matched_string"] in row2["matched_string"] and row["identifier"] != row2["identifier"]:
df4 = df4.append(row1)
df4 = df4.append(row2)
Would someone have an idea on how to make this more efficient?
CodePudding user response:
Wouldn't an aggregation as set
per matched_string
work for you?
df.groupby('matched_string')['identifier'].agg(set)
Output:
matched_string
manager {occupation, skill}
marketing {skill}
marketing manager {department}
Name: identifier, dtype: object
If you want to use it to filter your DataFrame and keep only the rows with multiple identifiers per matched_string:
s = df.groupby('matched_string')['identifier'].agg(set)
idx = s[s.str.len().gt(1)].index
out = df.loc[df['matched_string'].isin(idx)]
Output:
identifier matched_string
0 occupation manager
1 occupation manager
2 skill manager