Home > OS >  Iterate row by row through a column of lists and turning matches over X in a new pandas dataframe
Iterate row by row through a column of lists and turning matches over X in a new pandas dataframe

Time:06-01

I'm trying to iterate through list found in a pandas dataframe column, and returning results that matched more than three times across the lists contained in the other rows in a new dataframe.

Here's how the data looks:

enter image description here

Desired Output:

enter image description here

(The output is because those specific keywords were found in the lists in at least three other rows).

Minimum Reproducible Example:

import pandas as pd

# initialize data of lists.
data = {'url': ["www.bbc.co.uk", "www.cabinzero.com", "www.cntraveller.com", "www.forbes.com", "www.gov.scot", "www.gov.uk", "www.ons.gov.uk"],
        'keyword': ["['amber travel list', 'travel amber list', 'amber list countries uk travel', 'travel amber list countries', 'amber list countries travel']", "['amber list countries uk travel', 'travel amber list countries', 'amber travel list', 'travel amber list', 'amber list countries travel']", "['travel amber list', 'amber list countries uk travel', 'amber travel list', 'amber list countries travel', 'travel amber list countries']", "['amber travel list', 'travel amber list countries', 'travel amber list', 'amber list countries travel', 'amber list countries uk travel']", "['amber list countries travel', 'travel amber list countries', 'amber list countries uk travel', 'travel amber list', 'amber travel list']", "['amber list countries travel', 'amber list countries uk travel', 'amber travel list']", "['amber list countries uk travel', 'amber travel list', 'travel amber list countries', 'amber list countries travel']"]}

# Create DataFrame
df = pd.DataFrame(data)

# Print the output.
print(df)

What I've tried I've tried dumping the list column to a single list and iterating through to count the occurrences, but couldn't make it work and not sure if that's the best approach.

CodePudding user response:

If inside a same list each keyword is unique then you could:

from itertools import chain

listed_keywords = df.keyword.apply(lambda x: eval(x)).values # returns array of list
all_keywords = list(chain.from_iterable(listed_keywords)) # Concat all the lists into 1 global list of keywords

unique_keyword, nunique_keyword = np.unique(all_keywords, return_counts = True)# Return unique keywords and their respective frequency among all the keywords

df_keywords = pd.DataFrame(dict(keyword = unique_keyword, frequency = nunique_keyword)) # Create a DataFrame so you can easily filter according to keyword frequency.

Hope this answered your question !

  • Related