I have 2 dataframes.
df_dora
content | feature | id | |
---|---|---|---|
1 | cyber hygien | risk management | 1 |
2 | cyber risk | risk management | 2 |
... | ... | ... ... | |
59 | intellig share | information sharing | 63 |
60 | inform share | information sharing | 64 |
df_corpus
content | id | meta.name | meta._split_id | |
---|---|---|---|---|
0 | market grow cyber attack... | 56a2a2e28954537131a4aa734f49e361 | 14_Group_AG_2021 | 0 |
1 | sec form file index | 7aedfd4df02687d3dff9897c925da508 | 14_Group_AG_2021 | 1 |
... | ... | ... | ... | |
213769 | cyber secur alert parent compani fina... | ab10325601597f203f3f0af7aa647112 | 17_La_Banque_2021 | 8581 |
213770 | intellig share statement parent compani fina... | 6af5687ac31849d19d2048e0b2ca472d | 17_La_Banque_2021 | 8582 |
I am trying to extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name. I tried to use isin
df = df_corpus[df_corpus.content.isin(df_dora.content)]
len(df)
Returns only 17 rows
content | id | meta.name | meta | |
---|---|---|---|---|
41474 | incid | a4c478e0fad1b9775c05e01d871b3aaf | 3_Agricole_2021 | 10185 |
68690 | oper risk | 2e5139d82c242c89523110cc1110647a | 10_Banking_Group_PLC_2021 | 5525 |
... | ... | ... | ... | ... |
99259 | risk report | a84eefb9a4772d13eb67f2d6ae5215cb | 31_Building_Society_2021 | 4820 |
105662 | risk manag | e8050be841fedb6dd10599e8b4892a9f | 43_Bank_SA_2021 | 131 |
df_corpus.loc[df_corpus.content.isin(df_dora.content), 'content'].tolist()
also returns 17 rows
if I search for 2 of the terms that exist in df_dora directly in df_corpus
resiliency_term = df_corpus.loc[df_corpus['content'].str.contains("cyber risk|inform share", case=False)]
print(resiliency_term)
I get 243 rows (which matches what was in the original file.)
So given the above...my question is this how do I extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.
Thanks in advance for any help.
CodePudding user response:
unique_vals = '|'.join(df_dora.content.unique())
df_corpus.groupby('meta.name').apply(lambda x: x.content.str.findall(unique_vals).explode().value_counts())
Output given your four lines of each:
17_La_Banque_2021 intellig share 1
Name: content, dtype: int64