Remove items from list stored in DataFrame-CodePudding

I have a DataFrame that contains some (text-) cleaned ads in one column and some very basic description of the same ads in one other column. I also have term frequencies stored in a dictionary in 'keyword':frequency format.

Task would be to purge all terms from the list in the df that falls below a certain cutpoint level of frequency.

import pandas as pd

adset = {"ID": ["(1483785165, 2009)", "(1538280431, 2010)", "(1795044103, 2010)"],
        "Body":[['price', '#', 'bedrooms', '#', 'bathrooms', '#', 'garage'],['cindy', 'lavender', 'mid', 'state', 'realty'],['upgrades', 'galore', 'perfectly', 'maintained', 'home', 'formals']]}

df = pd.DataFrame(adset)

keyword_dict={}
for row in data['Body']:
    for word in row:
        if word in keyword_dict:
            keyword_dict[word] =1
        else:
            keyword_dict[word]=1

And here is where I got stuck:

def remove_sparse_words_from_df(df, term_freq, cutoff=1):
    for row in df['Body']:
        for word in row:
            if term_freq[word]<=cutoff:
    
    return df

My whole approach might be off - performance is a huge issue, the df has about 350k rows and the lists in the "Body" column might contain words ranging in number from a few hundred to few thousands. The reason for storing all the data in pandas df instead of lists is that I would like to keep the ID column, so I could later connect my data to some other analysis I've already done on the ads.

CodePudding user response：

IIUC, try:

use explode to split the list to individual rows
groupby and transform to get the count of the keyword in the dataframe and keep only rows where the "count" is greater than the cutoff
groupby and agg to get the original DataFrame structure.

cutoff = 1

df = df.explode("Body")
output = df.loc[df.groupby("Body")["ID"].transform("size").gt(1)].groupby("ID").agg(list)

>>> output
                         Body
ID                           
(1483785165, 2009)  [#, #, #]

Note: In your example "#" is the only "word" that occurs more than once.