I'm trying to iterate through list found in a pandas dataframe column, and returning results that matched more than three times across the lists contained in the other rows in a new dataframe.
Here's how the data looks:
Desired Output:
(The output is because those specific keywords were found in the lists in at least three other rows).
Minimum Reproducible Example:
import pandas as pd
# initialize data of lists.
data = {'url': ["www.bbc.co.uk", "www.cabinzero.com", "www.cntraveller.com", "www.forbes.com", "www.gov.scot", "www.gov.uk", "www.ons.gov.uk"],
'keyword': ["['amber travel list', 'travel amber list', 'amber list countries uk travel', 'travel amber list countries', 'amber list countries travel']", "['amber list countries uk travel', 'travel amber list countries', 'amber travel list', 'travel amber list', 'amber list countries travel']", "['travel amber list', 'amber list countries uk travel', 'amber travel list', 'amber list countries travel', 'travel amber list countries']", "['amber travel list', 'travel amber list countries', 'travel amber list', 'amber list countries travel', 'amber list countries uk travel']", "['amber list countries travel', 'travel amber list countries', 'amber list countries uk travel', 'travel amber list', 'amber travel list']", "['amber list countries travel', 'amber list countries uk travel', 'amber travel list']", "['amber list countries uk travel', 'amber travel list', 'travel amber list countries', 'amber list countries travel']"]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
What I've tried I've tried dumping the list column to a single list and iterating through to count the occurrences, but couldn't make it work and not sure if that's the best approach.
CodePudding user response:
If inside a same list each keyword is unique then you could:
from itertools import chain
listed_keywords = df.keyword.apply(lambda x: eval(x)).values # returns array of list
all_keywords = list(chain.from_iterable(listed_keywords)) # Concat all the lists into 1 global list of keywords
unique_keyword, nunique_keyword = np.unique(all_keywords, return_counts = True)# Return unique keywords and their respective frequency among all the keywords
df_keywords = pd.DataFrame(dict(keyword = unique_keyword, frequency = nunique_keyword)) # Create a DataFrame so you can easily filter according to keyword frequency.
Hope this answered your question !