Home > OS >  Search values in a Pandas DataFrame with values from another DataFrame
Search values in a Pandas DataFrame with values from another DataFrame

Time:05-16

I have 2 dataframes.

df_dora

content feature id
1 cyber hygien risk management 1
2 cyber risk risk management 2
... ... ... ...
59 intellig share information sharing 63
60 inform share information sharing 64

df_corpus

content id meta.name meta._split_id
0 market grow cyber attack... 56a2a2e28954537131a4aa734f49e361 14_Group_AG_2021 0
1 sec form file index 7aedfd4df02687d3dff9897c925da508 14_Group_AG_2021 1
... ... ... ...
213769 cyber secur alert parent compani fina... ab10325601597f203f3f0af7aa647112 17_La_Banque_2021 8581
213770 intellig share statement parent compani fina... 6af5687ac31849d19d2048e0b2ca472d 17_La_Banque_2021 8582

I am trying to extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name. I tried to use isin

df = df_corpus[df_corpus.content.isin(df_dora.content)]

len(df)

Returns only 17 rows

content id meta.name meta
41474 incid a4c478e0fad1b9775c05e01d871b3aaf 3_Agricole_2021 10185
68690 oper risk 2e5139d82c242c89523110cc1110647a 10_Banking_Group_PLC_2021 5525
... ... ... ... ...
99259 risk report a84eefb9a4772d13eb67f2d6ae5215cb 31_Building_Society_2021 4820
105662 risk manag e8050be841fedb6dd10599e8b4892a9f 43_Bank_SA_2021 131

df_corpus.loc[df_corpus.content.isin(df_dora.content), 'content'].tolist()

also returns 17 rows

if I search for 2 of the terms that exist in df_dora directly in df_corpus resiliency_term = df_corpus.loc[df_corpus['content'].str.contains("cyber risk|inform share", case=False)]

print(resiliency_term)

I get 243 rows (which matches what was in the original file.)

So given the above...my question is this how do I extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.

Thanks in advance for any help.

CodePudding user response:

unique_vals = '|'.join(df_dora.content.unique())
df_corpus.groupby('meta.name').apply(lambda x: x.content.str.findall(unique_vals).explode().value_counts())

Output given your four lines of each:

17_La_Banque_2021  intellig share    1
Name: content, dtype: int64
  • Related