How Do I Count The Number of Times a Subset of Words Appear In My Pandas Dataframe?-CodePudding

I have a bunch of keywords stored in a 620x2 pandas dataframe seen below. I think I need to treat each entry as its own set, where semicolons separate elements. So, we end up with 1240 sets. Then I'd like to be able to search how many times keywords of my choosing appear together. For example, I'd like to figure out how many times 'computation theory' and 'critical infrastructure' appear together as a subset in these sets, in any order. Is there any straightforward way I can do this?

CodePudding user response：

Not sure if this is considered straightforward, but it works. keyword_list is the list of paired keywords you want to search.

df['Author Keywords'] = df['Author Keywords'].fillna('').str.split(';\s*').apply(set)
df['Index Keywords'] = df['Index Keywords'].fillna('').str.split(';\s*').apply(set)
df.apply(lambda x : x.apply(lambda y : all([kw in y for kw in keyword_list]))).sum().sum()

CodePudding user response：

Use .loc to find if the keywords appear together.

Do this after you have split the data into 1240 sets. I don't understand whether you want to make new columns or just want to keep the columns as is.

# create a filter for keyword 1
filter_keyword_2 = (df['column_name'].str.contains('critical infrastructure'))

# create a filter for keyword 2
filter_keyword_2 = (df['column_name'].str.contains('computation theory'))

# you can create more filters with the same construction as above.

# To check the number of times both the keywords appear

len(df.loc[filter_keyword_1 & filter_keyword_2])

# To see the dataframe

subset_df = df.loc[filter_keyword_1 & filter_keyword_2]

.loc selects the conditional data frame. You can use subset_df=df[df['column_name'].str.contains('string')] if you have only one condition.

To the column split or any other processing before you make the filters or run the filters again after processing.